Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad ARM peak performance (especially for Apple M1/M2) #155

Closed
tycho opened this issue Sep 11, 2022 · 5 comments
Closed

Bad ARM peak performance (especially for Apple M1/M2) #155

tycho opened this issue Sep 11, 2022 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@tycho
Copy link

tycho commented Sep 11, 2022

I'm not sure this calculation is accounting for everything it should:

cpufetch/src/arm/midr.c

Lines 73 to 92 in 61a1ad8

int64_t get_peak_performance(struct cpuInfo* cpu) {
struct cpuInfo* ptr = cpu;
//First check we have consistent data
for(int i=0; i < cpu->num_cpus; ptr = ptr->next_cpu, i++) {
if(get_freq(ptr->freq) == UNKNOWN_DATA) {
return -1;
}
}
int64_t flops = 0;
ptr = cpu;
for(int i=0; i < cpu->num_cpus; ptr = ptr->next_cpu, i++) {
flops += ptr->topo->total_cores * (get_freq(ptr->freq) * 1000000);
}
if(cpu->feat->NEON) flops = flops * 4;
return flops;
}

On my Apple M2, cpufetch claims the CPU has a peak performance of 100.8 GFLOP/s. But if I compile and run a multithreaded CPU-only floating point n-body simulation, it easily exceeds 220GFLOP/s (varies depending on the algorithm).

The algorithm appears to be: 1 instruction per cycle * cycle count * CPU count * SIMD width

I think somehow the algorithm needs to account for the variability in that implicit constant 1 in the calculation. Unfortunately that's not trivial since this is all basically inference, and it would definitely vary a lot by vendor and model. There's also some variability depending on the relative performance of "big" vs "little" cores, as the difference between those is not usually just clock frequency.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Sep 11, 2022

This is a good point. I suspect that peak performance is not accurate in most ARM CPUs, especially in M1/M2.

The peak performance is computed as: ALUs * Frequency (in hertz) * Physical Cores * SIMD width. For M1/M2 I assume a width of 4 because NEON and I account for the different frequencies depending on the big or little core. I assume ALUs=1 because I don't know a better approximation for that value - there's very little information regarding M1/M2 performance: If there is more technical information that aids with this issue I didn't find it. ALUs account for the number of functional units, which directly affects the instructions per cycle. In this case, I'm assuming CPI=1.

CPI might be 0.5 but if you say you get 220GFLOP/s (which does not surprise me, because M1/M2 appears to be pretty powerful), there should be something else I'm not accounting for. I suspect that M1/M2 has FMA capabilities but if I google that I don't get any good results. Just for curiosity, can you provide the link for the benchmark you used?

By the way, if you are interested in this, you might find peakperf interesting (although it only works in x86 and NVIDIA devices, not ARM).

@tycho
Copy link
Author

tycho commented Sep 11, 2022

This is a good point. I suspect that peak performance is not accurate in most ARM CPUs, especially in M1/M2.

The peak performance is computed as: ALUs * Frequency (in hertz) * Physical Cores * SIMD width. For M1/M2 I assume a width of 4 because NEON and I account for the different frequencies depending on the big or little core. I assume ALUs=1 because I don't know a better approximation for that value - there's very little information regarding M1/M2 performance: If there is more technical information that aids with this issue I didn't find it. ALUs account for the number of functional units, which directly affects the instructions per cycle. In this case, I'm assuming CPI=1.

I'd be concerned that the difference between big/little is going to be more than clock frequency. I suspect the little cores have fewer execution units to reduce die area and power consumption as well. And if that's the case, the instruction throughput will be different even if big/little had the same fixed clock frequencies. Hard to measure that with any microbenchmarks on macOS though, because there is no [public?] userspace thread affinity API and thus no way to control which CPU a thread runs on (or prevent it from migrating). Maybe I should get Asahi Linux running so I can other tests at some point.

CPI might be 0.5 but if you say you get 220GFLOP/s (which does not surprise me, because M1/M2 appears to be pretty powerful), there should be something else I'm not accounting for. I suspect that M1/M2 has FMA capabilities but if I google that I don't get any good results.

M1 and M2 do have FMA instructions, yeah. Disassembly of the n-body benchmark has a few fmadd, fmla, and other instructions scattered about.

Just for curiosity, can you provide the link for the benchmark you used?

This is the benchmark: https://github.com/tycho/nbody

Here's how you can build/run it.

#
# If you want to run more than a single thread, you'll need to get GCC installed.
# Apple's Clang does not support OpenMP for some reason (which is what I use for
# managing the threading).
#
$ brew install gcc

$ env CC=gcc-12 CXX=g++-12 meson . build
$ ninja -C build

#
# Optionally limit thread count to run only on performance cores, or just leave it
# undefined to use all logical CPUs.
#
$ export OMP_NUM_THREADS="$(sysctl -n hw.perflevel0.logicalcpu)"

#
# --bodies          number of bodies, multiplied by 1024
# --no-crosscheck   compares the currently running algorithm's results to a
#                   reference algorithm (SOA) and prints the delta between
#                   the results. since this means each timestep is computed twice,
#                   this slows things down and should just be turned off
# --iterations      number of loops to execute
# --cycle-after     number of steps to execute with each algorithm before rotating
#                   to the next one
# --verbose         ensures it always prints a status line after each step
#
$ build/nbody --bodies 64 --no-crosscheck --iterations 1 --cycle-after 5 --verbose

By the way, if you are interested in this, you might find peakperf interesting (although it only works in x86 and NVIDIA devices, not ARM).

Ah, neat! I should try that out.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Sep 12, 2022

I'd be concerned that the difference between big/little is going to be more than clock frequency. I suspect the little cores have fewer execution units to reduce die area and power consumption as well. And if that's the case, the instruction throughput will be different even if big/little had the same fixed clock frequencies. Hard to measure that with any microbenchmarks on macOS though, because there is no [public?] userspace thread affinity API and thus no way to control which CPU a thread runs on (or prevent it from migrating). Maybe I should get Asahi Linux running so I can other tests at some point.

Sure, but unfortunately I don't know any good source to look for the difference between small/big cores depending on the uarch. Same for M1. As you said, it's pretty difficult also because of the lack of software support for low-level features.

M1 and M2 do have FMA instructions, yeah. Disassembly of the n-body benchmark has a few fmadd, fmla, and other instructions scattered about.

I suspected it but I wasn't sure. I don't know where it comes from. It should be a part of NEON but I have to investigate further how FMA works in ARM. Improving this is something I have had in mind since last year (issue #132). I'll have a look into that one day. Also, it seems like the M1 has pretty good FMA units: https://dougallj.github.io/applecpu/firestorm-simd.html

Here's how you can build/run it.

I don't own an M1/M2. But if I did, I would love to make a peakperf version for M1/M2. I'm very curious to see the real peak performance of that chip, as it should be pretty high (especially if considering its relatively low energy consumption).

@Dr-Noob
Copy link
Owner

Dr-Noob commented Sep 14, 2022

I'm sure there's a page somewhere that depicts the peak performance of M1 and explains everything nicely, but I was unable to find it. What I just found is a benchmark for M1 claiming to achieve the theoretical performance that I assumed in my previous post (assuming it has 4 FMA ALUs).

If I understand everything right, it gets 102 GFLOP/s using 1 big core - so it would be 409.6 GFLOP/s for the big cores. But little cores will probably have fewer FMA units. Well, according to https://dougallj.github.io/applecpu/icestorm-simd.html, little cores have 2 FMA units. That would give us:

3.2*2*4*4*4 + 2.064*2*4*2*4 = 541.69 GFLOP/s

It would be needed to find exact frequencies when running FMA ops too.

Sounds good?

@Dr-Noob Dr-Noob changed the title ARM peak performance calculation accuracy Bad ARM peak performance (especially for Apple M1/M2) Sep 14, 2022
Dr-Noob added a commit that referenced this issue Sep 14, 2022
@Dr-Noob Dr-Noob added the help wanted Extra attention is needed label Sep 14, 2022
@tycho
Copy link
Author

tycho commented Sep 14, 2022

That does sound better to me! Thanks for investigating this.

@tycho tycho closed this as completed Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants