Description
Describe the project you are working on
Something Godot related.
Describe the problem or limitation you are having in your project
I'm proposing a performance improvement on nearly all intel macOS computers (2013 and newer), at the cost of binary size (or compatibility).
Describe the feature / enhancement and how it helps to overcome the problem or limitation
On macOS, Godot currently builds for x86_64
and arm64
.
macOS supports another relevant type of binary slice, x86_64h
. This binary slice is preferred over x86_64
on all haswell and newer CPUs. This includes all post-2013 macs. x86_64
, for comparison, was adopted by Apple in 2006 and likely finalized the supported instruction set then.
Among others, the x86_64h
slice implies haswell or newer and therefore enables SIMD instruction sets (and auto-vectorization for) SSE3
, SSSE3
, SSE4.1
, SSE4.2
, AVX
and AVX2
, improving performance. @Calinou has recently, for another GOP, benchmarked just SSE4.2
(and implied below) performance gains to ~10%. It is reasonable to assume the other implied flags from this architecture change will improve performance further. I have found an up to 2x throughput increase (high volume float addition, 50k floats, 600ms vs 300ms) - though this improvement is limited to specific applications of avx2.
Here's a complete list of instruction set changes (reference):
diff <(g++ -Q --help=target) <(g++ -Q -march=haswell --help=target)
(note that passing -march=haswell
is not the exact same as building for the x86_64h
binary slice. For example, mtune
may be set to something different than core2
, such as generic
. I don't know a good way to test the exact command though)
❯ diff <(g++ -Q --help=target) <(g++ -Q -march=haswell --help=target)
31c31
< -march= x86-64
---
> -march= haswell
34,35c34,35
< -mavx [disabled]
< -mavx2 [disabled]
---
> -mavx [enabled]
> -mavx2 [enabled]
60,61c60,61
< -mbmi [disabled]
< -mbmi2 [disabled]
---
> -mbmi [enabled]
> -mbmi2 [enabled]
74,75c74,75
< -mcrc32 [disabled]
< -mcx16 [disabled]
---
> -mcrc32 [enabled]
> -mcx16 [enabled]
82c82
< -mf16c [disabled]
---
> -mf16c [enabled]
88c88
< -mfma [disabled]
---
> -mfma [enabled]
94c94
< -mfsgsbase [disabled]
---
> -mfsgsbase [enabled]
103c103
< -mhle [disabled]
---
> -mhle [enabled]
123c123
< -mlzcnt [disabled]
---
> -mlzcnt [enabled]
130c130
< -mmovbe [disabled]
---
> -mmovbe [enabled]
133c133
< -mmove-max= 128
---
> -mmove-max= 256
144c144
< -mno-sse4 [enabled]
---
> -mno-sse4 [disabled]
151c151
< -mpclmul [disabled]
---
> -mpclmul [enabled]
155c155
< -mpopcnt [disabled]
---
> -mpopcnt [enabled]
166c166
< -mrdrnd [disabled]
---
> -mrdrnd [enabled]
177c177
< -msahf [disabled]
---
> -msahf [enabled]
189,191c189,191
< -msse4 [disabled]
< -msse4.1 [disabled]
< -msse4.2 [disabled]
---
> -msse4 [enabled]
> -msse4.1 [enabled]
> -msse4.2 [enabled]
195c195
< -mssse3 [disabled]
---
> -mssse3 [enabled]
202c202
< -mstore-max= 128
---
> -mstore-max= 256
213c213
< -mtune= core2
---
> -mtune= haswell
226c226
< -mxsave [disabled]
---
> -mxsave [enabled]
228c228
< -mxsaveopt [disabled]
---
> -mxsaveopt [enabled]
Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams
Fortunately it's pretty simple: detect.py would either:
- Add
x86_64h
as a target, increasing the universal fat binary size (by up to 50%).- It may be possible to add an option to the export template whether to strip
x86_64h
again from released games, for projects where binary size is more important than speed.
- It may be possible to add an option to the export template whether to strip
- Replace
x86_64
withx86_64h
, cutting support for macs older than 2013.- macOS itself cut support for pre-haswell computers in 2021 with Monterey.
- If we do this, it's reasonable to question whether we don't want to pass
march=haswell
(or similar but lower, such asnehalem
) on otherx86_64
targets too, so that Linux and Windows can also benefit from the performance changes.
- Add the option to add
x86_64h
to the build, but do not modify any defaults.
Additionally, for non-universal GDExtensions, godot would need to expose an x86_64h
feature tag.
If this enhancement will not be used often, can it be worked around with a few lines of script?
It cannot.
Is there a reason why this should be core and not an add-on in the asset library?
It's core.