-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad ARM peak performance (especially for Apple M1/M2) #155
Comments
This is a good point. I suspect that peak performance is not accurate in most ARM CPUs, especially in M1/M2. The peak performance is computed as: CPI might be 0.5 but if you say you get 220GFLOP/s (which does not surprise me, because M1/M2 appears to be pretty powerful), there should be something else I'm not accounting for. I suspect that M1/M2 has FMA capabilities but if I google that I don't get any good results. Just for curiosity, can you provide the link for the benchmark you used? By the way, if you are interested in this, you might find peakperf interesting (although it only works in x86 and NVIDIA devices, not ARM). |
I'd be concerned that the difference between big/little is going to be more than clock frequency. I suspect the little cores have fewer execution units to reduce die area and power consumption as well. And if that's the case, the instruction throughput will be different even if big/little had the same fixed clock frequencies. Hard to measure that with any microbenchmarks on macOS though, because there is no [public?] userspace thread affinity API and thus no way to control which CPU a thread runs on (or prevent it from migrating). Maybe I should get Asahi Linux running so I can other tests at some point.
M1 and M2 do have FMA instructions, yeah. Disassembly of the n-body benchmark has a few
This is the benchmark: https://github.com/tycho/nbody Here's how you can build/run it.
Ah, neat! I should try that out. |
Sure, but unfortunately I don't know any good source to look for the difference between small/big cores depending on the uarch. Same for M1. As you said, it's pretty difficult also because of the lack of software support for low-level features.
I suspected it but I wasn't sure. I don't know where it comes from. It should be a part of NEON but I have to investigate further how FMA works in ARM. Improving this is something I have had in mind since last year (issue #132). I'll have a look into that one day. Also, it seems like the M1 has pretty good FMA units: https://dougallj.github.io/applecpu/firestorm-simd.html
I don't own an M1/M2. But if I did, I would love to make a peakperf version for M1/M2. I'm very curious to see the real peak performance of that chip, as it should be pretty high (especially if considering its relatively low energy consumption). |
I'm sure there's a page somewhere that depicts the peak performance of M1 and explains everything nicely, but I was unable to find it. What I just found is a benchmark for M1 claiming to achieve the theoretical performance that I assumed in my previous post (assuming it has 4 FMA ALUs). If I understand everything right, it gets 102 GFLOP/s using 1 big core - so it would be 409.6 GFLOP/s for the big cores. But little cores will probably have fewer FMA units. Well, according to https://dougallj.github.io/applecpu/icestorm-simd.html, little cores have 2 FMA units. That would give us:
It would be needed to find exact frequencies when running FMA ops too. Sounds good? |
That does sound better to me! Thanks for investigating this. |
I'm not sure this calculation is accounting for everything it should:
cpufetch/src/arm/midr.c
Lines 73 to 92 in 61a1ad8
On my Apple M2, cpufetch claims the CPU has a peak performance of 100.8 GFLOP/s. But if I compile and run a multithreaded CPU-only floating point n-body simulation, it easily exceeds 220GFLOP/s (varies depending on the algorithm).
The algorithm appears to be:
1 instruction per cycle * cycle count * CPU count * SIMD width
I think somehow the algorithm needs to account for the variability in that implicit constant
1
in the calculation. Unfortunately that's not trivial since this is all basically inference, and it would definitely vary a lot by vendor and model. There's also some variability depending on the relative performance of "big" vs "little" cores, as the difference between those is not usually just clock frequency.The text was updated successfully, but these errors were encountered: