Fused CPU Adam performance #574

msaroufim · 2024-03-28T00:44:58Z

Describe the issue

I'm trying to leverage a fast CPU ADAM implementation and I've found many ways of doing so that provide slightly different perf. One setting is downright confusing as well so opening this issue to discuss

Repro is here

Results

Existing Adam optimizer time using PyTorch eager: 3.4665 seconds
Fused Adam optimizer time using optimizer_fusion: 3.2542 seconds
Fused Adam optimizer time using ipex_adam_step: 3.2268 seconds
Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)
torch.compile optimizer time: 4.1160 seconds

Experiments were performed on

(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8488C
Stepping:                        8
CPU MHz:                         2400.000
BogoMIPS:                        4800.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB
L1i cache:                       512 KiB
L2 cache:                        32 MiB
L3 cache:                        105 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtop
                                 ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h
                                 ypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx51
                                 2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni
                                  vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabil
                                 ities
(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$

The text was updated successfully, but these errors were encountered:

xiguiw · 2024-03-28T01:00:44Z

@msaroufim
What's your expected result?

msaroufim · 2024-03-28T01:03:26Z

This is the main one that's throwing me off

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

To repro replace _ by model in this line https://github.com/msaroufim/tinyoptimizer/blob/master/cpu_optimizer/ipex/class.py#L100

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

jgong5 · 2024-03-28T01:53:11Z

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

I don't think it is expected and guess there is something else going on here. Do you have the profiler info and perhaps we can look into the problem with it?

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

If we talk about the Adam optimizer alone, 2x makes more sense to me with fused one but it depends on the model sizes, the larger model sizes, the more benefit we get from fusion.

jgong5 · 2024-03-28T01:53:26Z

cc @zhuhaozhe

msaroufim · 2024-03-28T05:30:09Z

I don't have any profile data available but the results were reliably repro-ing in the repro I linked in the original message. Let me know if there's any other info I can provide to make debugging this easier

sanchitintel · 2024-03-28T09:13:09Z

Hi @msaroufim,

Upon changing model, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())) to
_, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())), at my end, I did see that the former was ~10% slower than the latter for the model you used (only one linear layer), but the difference wasn't as significant as what you encountered.

Nevertheless, we'll try to fix this regression. Thanks!

sanchitintel · 2024-03-28T16:12:29Z

Investigating why 4 was faster than 2 or 3

sanchitintel · 2024-03-28T20:56:38Z

Hi @msaroufim, when ipex.optimize is used, _copy_model_and_optimizer is called, if the model & optimizer can't be modified inplace, which is the default case.

This method is responsible for the speedup when ipex.optimize is used with fused Adam (datapoint 4 in the description, not referring to FusedCPUAdam), as opposed to datapoints 2 or 3, in which case, this method is not called.

I'll figure out what precisely in this method is resulting in a speedup.

Thanks!

sanchitintel · 2024-03-28T21:08:08Z

Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for 4 over 2/3. I verified this hypothesis by simply commenting out most of the code in _copy_model_and_optimizer.

https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L46

@jgong5 @zhuhaozhe, can you please elaborate on why that'd result in a speedup? Thanks!

sanchitintel · 2024-04-02T22:52:12Z

@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint 1 being faster than datapoint 6 (i.e. PyTorch eager mode being faster than torch.compile for unfused Adam optimizer), which might also result in eager mode fused Adam optimizer being faster than its torch.compile counterpart (after fused Adam optimizer would be enabled in PyTorch).

sanchitintel · 2024-04-03T00:55:16Z

Setting OMP_NUM_THREADS & MKL_NUM_THREADS environment variables (or using torch.set_num_threads) reduces the gap between 1 & 6 but doesn't eliminate it.

I used something like this (In lscpu output, cores 0-15 were on the same socket, i.e. I only used one of the two logical cores per physical core) -

OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 numactl --membind=0 --cpunodebind=0 -C 0-15 python script_name.py

I had also preloaded Intel OpenMP (instead of GNU libgomp) & tcmalloc.

Benchmarking results with torch.compile (datapoint 6): https://gist.github.com/sanchitintel/c2ccda7bdd58be9c12ecf16fa4680f25
Benchmarking results with eager-mode (datapoint 1):
https://gist.github.com/sanchitintel/8789298ee88b013c2bfb4b99b36e22ef

@jgong5, with torch.compile, the bottleneck seems to be Torch compiled region, despite using torch._inductor.config.cpp.enable_kernel_profile=True.

sanchitintel · 2024-04-03T01:39:26Z

@msaroufim @jgong5,

There are graph breaks with torch.compile when an unfused optimizer is used. That's resulting in the overhead.

sanchitintel · 2024-04-03T18:45:27Z

Hi @msaroufim, these graph breaks are being used in PyTorch source-code. As per pytorch/pytorch#104053, they will be removed when solution 3 in that ticket (the entire graph is an inference graph) will be implemented. Thanks!

@jgong5 @Guobing-Chen, Dynamo logs pertaining to graph breaks are at https://gist.github.com/sanchitintel/05b19b6d162cf5cdf5dbb174c51962ec. They were collected with the environment variable TORCH_LOGS="+dynamo". Is a workaround possible? Otherwise, after fused Adam optimizer would be enabled in PyTorch, training with eager mode fused Adam optimizer may be faster than training with torch.compile.

Thanks!

zhuhaozhe · 2024-06-14T02:49:47Z

Hi, @msaroufim, cc @sanchitintel.
For the ipex-fused-optimizer, we actually expect user to auto apply it by ipex.optimize instead of directly use the fused_adam_step or optimizer_fusion ( We may should rename it to _optimizer_fusion and _fused_adam_step to avoid misunderstanding.)
To only benchmark the optimizer, we have provided an example here https://github.com/intel/intel-extension-for-pytorch/tree/main/tests/cpu/bench/custom_op_bench#evaluate-ipex-fused-optimizer.

Btw, we have already upstream fused adam/adamw/adagrad/sgd into Pytorch, do you need more helps here?
pytorch/pytorch#124905
pytorch/pytorch#123629
pytorch/pytorch#123074

msaroufim · 2024-06-14T03:07:32Z

I'm quite happy with the new fused eager kernel that's been upstreamed to PyTorch. Still not sure why compile is so slow though so will @jgong5 decide where he wants to track this

zhuhaozhe · 2024-06-14T03:16:05Z

Hi, @msaroufim. I have some benchmark results to compare fused/non-fused/compile, https://github.com/zhuhaozhe/Misc/blob/main/bench-fused-optimizer/bench-result.md.
For compile, I found when the number of parameter get larger, the compile results will worse, I have try some approach to manually modified the generated code but no insights are found yet. And we will keep tracking it. pytorch/pytorch#123238
I will updated it after we have more findings.

msaroufim · 2024-06-14T03:28:29Z

Ok sounds good will close this in favor of the issue in PyTorch

This comment was marked as outdated.

Sign in to view

ZhaoqiongZ added CPU CPU specific issues Performance labels Apr 24, 2024

yinghu5 assigned yinghu5 and xiguiw and unassigned yinghu5 May 20, 2024

yinghu5 added the Escalate label May 20, 2024

msaroufim closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused CPU Adam performance #574

Fused CPU Adam performance #574

msaroufim commented Mar 28, 2024 •

edited

Loading

xiguiw commented Mar 28, 2024

msaroufim commented Mar 28, 2024 •

edited

Loading

jgong5 commented Mar 28, 2024

jgong5 commented Mar 28, 2024

msaroufim commented Mar 28, 2024

This comment was marked as outdated.

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Apr 2, 2024 •

edited

Loading

sanchitintel commented Apr 3, 2024 •

edited

Loading

sanchitintel commented Apr 3, 2024

sanchitintel commented Apr 3, 2024 •

edited

Loading

zhuhaozhe commented Jun 14, 2024

msaroufim commented Jun 14, 2024 •

edited

Loading

zhuhaozhe commented Jun 14, 2024

msaroufim commented Jun 14, 2024

Fused CPU Adam performance #574

Fused CPU Adam performance #574

Comments

msaroufim commented Mar 28, 2024 • edited Loading

Describe the issue

xiguiw commented Mar 28, 2024

msaroufim commented Mar 28, 2024 • edited Loading

jgong5 commented Mar 28, 2024

jgong5 commented Mar 28, 2024

msaroufim commented Mar 28, 2024

This comment was marked as outdated.

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Mar 28, 2024

sanchitintel commented Apr 2, 2024 • edited Loading

sanchitintel commented Apr 3, 2024 • edited Loading

sanchitintel commented Apr 3, 2024

sanchitintel commented Apr 3, 2024 • edited Loading

zhuhaozhe commented Jun 14, 2024

msaroufim commented Jun 14, 2024 • edited Loading

zhuhaozhe commented Jun 14, 2024

msaroufim commented Jun 14, 2024

msaroufim commented Mar 28, 2024 •

edited

Loading

msaroufim commented Mar 28, 2024 •

edited

Loading

sanchitintel commented Apr 2, 2024 •

edited

Loading

sanchitintel commented Apr 3, 2024 •

edited

Loading

sanchitintel commented Apr 3, 2024 •

edited

Loading

msaroufim commented Jun 14, 2024 •

edited

Loading