Bad ARM peak performance (especially for Apple M1/M2) #155

tycho · 2022-09-11T06:01:03Z

I'm not sure this calculation is accounting for everything it should:

Lines 73 to 92 in 61a1ad8

 int64_t get_peak_performance(struct cpuInfo* cpu) { 

 struct cpuInfo* ptr = cpu; 

 //First check we have consistent data 

 for(int i=0; i < cpu->num_cpus; ptr = ptr->next_cpu, i++) { 

 if(get_freq(ptr->freq) == UNKNOWN_DATA) { 

 return -1; 

 } 

 } 

 int64_t flops = 0; 

 ptr = cpu; 

 for(int i=0; i < cpu->num_cpus; ptr = ptr->next_cpu, i++) { 

 flops += ptr->topo->total_cores * (get_freq(ptr->freq) * 1000000); 

 } 

 if(cpu->feat->NEON) flops = flops * 4; 

 return flops; 

 }

On my Apple M2, cpufetch claims the CPU has a peak performance of 100.8 GFLOP/s. But if I compile and run a multithreaded CPU-only floating point n-body simulation, it easily exceeds 220GFLOP/s (varies depending on the algorithm).

The algorithm appears to be: 1 instruction per cycle * cycle count * CPU count * SIMD width

I think somehow the algorithm needs to account for the variability in that implicit constant 1 in the calculation. Unfortunately that's not trivial since this is all basically inference, and it would definitely vary a lot by vendor and model. There's also some variability depending on the relative performance of "big" vs "little" cores, as the difference between those is not usually just clock frequency.

The text was updated successfully, but these errors were encountered:

Dr-Noob · 2022-09-11T08:57:27Z

This is a good point. I suspect that peak performance is not accurate in most ARM CPUs, especially in M1/M2.

The peak performance is computed as: ALUs * Frequency (in hertz) * Physical Cores * SIMD width. For M1/M2 I assume a width of 4 because NEON and I account for the different frequencies depending on the big or little core. I assume ALUs=1 because I don't know a better approximation for that value - there's very little information regarding M1/M2 performance: If there is more technical information that aids with this issue I didn't find it. ALUs account for the number of functional units, which directly affects the instructions per cycle. In this case, I'm assuming CPI=1.

CPI might be 0.5 but if you say you get 220GFLOP/s (which does not surprise me, because M1/M2 appears to be pretty powerful), there should be something else I'm not accounting for. I suspect that M1/M2 has FMA capabilities but if I google that I don't get any good results. Just for curiosity, can you provide the link for the benchmark you used?

By the way, if you are interested in this, you might find peakperf interesting (although it only works in x86 and NVIDIA devices, not ARM).

tycho · 2022-09-11T10:53:39Z

This is a good point. I suspect that peak performance is not accurate in most ARM CPUs, especially in M1/M2.

The peak performance is computed as: ALUs * Frequency (in hertz) * Physical Cores * SIMD width. For M1/M2 I assume a width of 4 because NEON and I account for the different frequencies depending on the big or little core. I assume ALUs=1 because I don't know a better approximation for that value - there's very little information regarding M1/M2 performance: If there is more technical information that aids with this issue I didn't find it. ALUs account for the number of functional units, which directly affects the instructions per cycle. In this case, I'm assuming CPI=1.

I'd be concerned that the difference between big/little is going to be more than clock frequency. I suspect the little cores have fewer execution units to reduce die area and power consumption as well. And if that's the case, the instruction throughput will be different even if big/little had the same fixed clock frequencies. Hard to measure that with any microbenchmarks on macOS though, because there is no [public?] userspace thread affinity API and thus no way to control which CPU a thread runs on (or prevent it from migrating). Maybe I should get Asahi Linux running so I can other tests at some point.

CPI might be 0.5 but if you say you get 220GFLOP/s (which does not surprise me, because M1/M2 appears to be pretty powerful), there should be something else I'm not accounting for. I suspect that M1/M2 has FMA capabilities but if I google that I don't get any good results.

M1 and M2 do have FMA instructions, yeah. Disassembly of the n-body benchmark has a few fmadd, fmla, and other instructions scattered about.

Just for curiosity, can you provide the link for the benchmark you used?

This is the benchmark: https://github.com/tycho/nbody

Here's how you can build/run it.

#
# If you want to run more than a single thread, you'll need to get GCC installed.
# Apple's Clang does not support OpenMP for some reason (which is what I use for
# managing the threading).
#
$ brew install gcc

$ env CC=gcc-12 CXX=g++-12 meson . build
$ ninja -C build

#
# Optionally limit thread count to run only on performance cores, or just leave it
# undefined to use all logical CPUs.
#
$ export OMP_NUM_THREADS="$(sysctl -n hw.perflevel0.logicalcpu)"

#
# --bodies          number of bodies, multiplied by 1024
# --no-crosscheck   compares the currently running algorithm's results to a
#                   reference algorithm (SOA) and prints the delta between
#                   the results. since this means each timestep is computed twice,
#                   this slows things down and should just be turned off
# --iterations      number of loops to execute
# --cycle-after     number of steps to execute with each algorithm before rotating
#                   to the next one
# --verbose         ensures it always prints a status line after each step
#
$ build/nbody --bodies 64 --no-crosscheck --iterations 1 --cycle-after 5 --verbose

By the way, if you are interested in this, you might find peakperf interesting (although it only works in x86 and NVIDIA devices, not ARM).

Ah, neat! I should try that out.

Dr-Noob · 2022-09-12T10:36:13Z

I'd be concerned that the difference between big/little is going to be more than clock frequency. I suspect the little cores have fewer execution units to reduce die area and power consumption as well. And if that's the case, the instruction throughput will be different even if big/little had the same fixed clock frequencies. Hard to measure that with any microbenchmarks on macOS though, because there is no [public?] userspace thread affinity API and thus no way to control which CPU a thread runs on (or prevent it from migrating). Maybe I should get Asahi Linux running so I can other tests at some point.

Sure, but unfortunately I don't know any good source to look for the difference between small/big cores depending on the uarch. Same for M1. As you said, it's pretty difficult also because of the lack of software support for low-level features.

M1 and M2 do have FMA instructions, yeah. Disassembly of the n-body benchmark has a few fmadd, fmla, and other instructions scattered about.

I suspected it but I wasn't sure. I don't know where it comes from. It should be a part of NEON but I have to investigate further how FMA works in ARM. Improving this is something I have had in mind since last year (issue #132). I'll have a look into that one day. Also, it seems like the M1 has pretty good FMA units: https://dougallj.github.io/applecpu/firestorm-simd.html

Here's how you can build/run it.

I don't own an M1/M2. But if I did, I would love to make a peakperf version for M1/M2. I'm very curious to see the real peak performance of that chip, as it should be pretty high (especially if considering its relatively low energy consumption).

Dr-Noob · 2022-09-14T08:24:39Z

I'm sure there's a page somewhere that depicts the peak performance of M1 and explains everything nicely, but I was unable to find it. What I just found is a benchmark for M1 claiming to achieve the theoretical performance that I assumed in my previous post (assuming it has 4 FMA ALUs).

If I understand everything right, it gets 102 GFLOP/s using 1 big core - so it would be 409.6 GFLOP/s for the big cores. But little cores will probably have fewer FMA units. Well, according to https://dougallj.github.io/applecpu/icestorm-simd.html, little cores have 2 FMA units. That would give us:

3.2*2*4*4*4 + 2.064*2*4*2*4 = 541.69 GFLOP/s

It would be needed to find exact frequencies when running FMA ops too.

Sounds good?

…o the discussion in #155

tycho · 2022-09-14T12:40:28Z

That does sound better to me! Thanks for investigating this.

Dr-Noob changed the title ~~ARM peak performance calculation accuracy~~ Bad ARM peak performance (especially for Apple M1/M2) Sep 14, 2022

Dr-Noob added a commit that referenced this issue Sep 14, 2022

[v1.02][ARM] Updating M1/M2 peak performance calculations according t…

9f7204d

…o the discussion in #155

Dr-Noob added the help wanted Extra attention is needed label Sep 14, 2022

tycho closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad ARM peak performance (especially for Apple M1/M2) #155

Bad ARM peak performance (especially for Apple M1/M2) #155

tycho commented Sep 11, 2022

Dr-Noob commented Sep 11, 2022

tycho commented Sep 11, 2022 •

edited

Loading

Dr-Noob commented Sep 12, 2022 •

edited

Loading

Dr-Noob commented Sep 14, 2022

tycho commented Sep 14, 2022

Bad ARM peak performance (especially for Apple M1/M2) #155

Bad ARM peak performance (especially for Apple M1/M2) #155

Comments

tycho commented Sep 11, 2022

Dr-Noob commented Sep 11, 2022

tycho commented Sep 11, 2022 • edited Loading

Dr-Noob commented Sep 12, 2022 • edited Loading

Dr-Noob commented Sep 14, 2022

tycho commented Sep 14, 2022

tycho commented Sep 11, 2022 •

edited

Loading

Dr-Noob commented Sep 12, 2022 •

edited

Loading