-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused CPU Adam performance #574
Comments
@msaroufim |
This is the main one that's throwing me off Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds To repro replace _ by model in this line https://github.com/msaroufim/tinyoptimizer/blob/master/cpu_optimizer/ipex/class.py#L100 I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes |
I don't think it is expected and guess there is something else going on here. Do you have the profiler info and perhaps we can look into the problem with it?
If we talk about the Adam optimizer alone, 2x makes more sense to me with fused one but it depends on the model sizes, the larger model sizes, the more benefit we get from fusion. |
cc @zhuhaozhe |
I don't have any profile data available but the results were reliably repro-ing in the repro I linked in the original message. Let me know if there's any other info I can provide to make debugging this easier |
This comment was marked as outdated.
This comment was marked as outdated.
Hi @msaroufim, Upon changing Nevertheless, we'll try to fix this regression. Thanks! |
Investigating why |
Hi @msaroufim, when This method is responsible for the speedup when I'll figure out what precisely in this method is resulting in a speedup. Thanks! |
Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for @jgong5 @zhuhaozhe, can you please elaborate on why that'd result in a speedup? Thanks! |
@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint |
Setting I used something like this (In
I had also preloaded Intel OpenMP (instead of GNU libgomp) & tcmalloc. Benchmarking results with torch.compile (datapoint @jgong5, with |
There are graph breaks with |
Hi @msaroufim, these graph breaks are being used in PyTorch source-code. As per pytorch/pytorch#104053, they will be removed when solution 3 in that ticket ( @jgong5 @Guobing-Chen, Dynamo logs pertaining to graph breaks are at https://gist.github.com/sanchitintel/05b19b6d162cf5cdf5dbb174c51962ec. They were collected with the environment variable Thanks! |
Hi, @msaroufim, cc @sanchitintel. Btw, we have already upstream fused adam/adamw/adagrad/sgd into Pytorch, do you need more helps here? |
I'm quite happy with the new fused eager kernel that's been upstreamed to PyTorch. Still not sure why compile is so slow though so will @jgong5 decide where he wants to track this |
Hi, @msaroufim. I have some benchmark results to compare fused/non-fused/compile, https://github.com/zhuhaozhe/Misc/blob/main/bench-fused-optimizer/bench-result.md. |
Ok sounds good will close this in favor of the issue in PyTorch |
Describe the issue
I'm trying to leverage a fast CPU ADAM implementation and I've found many ways of doing so that provide slightly different perf. One setting is downright confusing as well so opening this issue to discuss
Repro is here
Results
Experiments were performed on
The text was updated successfully, but these errors were encountered: