Skip to content

Support x86_64h builds in macOS, improving performance for nearly all intel macs #11150

Open
@Ivorforce

Description

@Ivorforce

Describe the project you are working on

Something Godot related.

Describe the problem or limitation you are having in your project

I'm proposing a performance improvement on nearly all intel macOS computers (2013 and newer), at the cost of binary size (or compatibility).

Describe the feature / enhancement and how it helps to overcome the problem or limitation

On macOS, Godot currently builds for x86_64 and arm64.

macOS supports another relevant type of binary slice, x86_64h. This binary slice is preferred over x86_64 on all haswell and newer CPUs. This includes all post-2013 macs. x86_64, for comparison, was adopted by Apple in 2006 and likely finalized the supported instruction set then.

Among others, the x86_64h slice implies haswell or newer and therefore enables SIMD instruction sets (and auto-vectorization for) SSE3 , SSSE3, SSE4.1, SSE4.2, AVX and AVX2, improving performance. @Calinou has recently, for another GOP, benchmarked just SSE4.2 (and implied below) performance gains to ~10%. It is reasonable to assume the other implied flags from this architecture change will improve performance further. I have found an up to 2x throughput increase (high volume float addition, 50k floats, 600ms vs 300ms) - though this improvement is limited to specific applications of avx2.

Here's a complete list of instruction set changes (reference):

diff <(g++ -Q --help=target) <(g++ -Q -march=haswell --help=target)

(note that passing -march=haswell is not the exact same as building for the x86_64h binary slice. For example, mtune may be set to something different than core2, such as generic. I don't know a good way to test the exact command though)

 ❯ diff <(g++ -Q --help=target) <(g++ -Q -march=haswell --help=target)
31c31
<   -march=                                     x86-64
---
>   -march=                                     haswell
34,35c34,35
<   -mavx                                       [disabled]
<   -mavx2                                      [disabled]
---
>   -mavx                                       [enabled]
>   -mavx2                                      [enabled]
60,61c60,61
<   -mbmi                                       [disabled]
<   -mbmi2                                      [disabled]
---
>   -mbmi                                       [enabled]
>   -mbmi2                                      [enabled]
74,75c74,75
<   -mcrc32                                     [disabled]
<   -mcx16                                      [disabled]
---
>   -mcrc32                                     [enabled]
>   -mcx16                                      [enabled]
82c82
<   -mf16c                                      [disabled]
---
>   -mf16c                                      [enabled]
88c88
<   -mfma                                       [disabled]
---
>   -mfma                                       [enabled]
94c94
<   -mfsgsbase                                  [disabled]
---
>   -mfsgsbase                                  [enabled]
103c103
<   -mhle                                       [disabled]
---
>   -mhle                                       [enabled]
123c123
<   -mlzcnt                                     [disabled]
---
>   -mlzcnt                                     [enabled]
130c130
<   -mmovbe                                     [disabled]
---
>   -mmovbe                                     [enabled]
133c133
<   -mmove-max=                                 128
---
>   -mmove-max=                                 256
144c144
<   -mno-sse4                                   [enabled]
---
>   -mno-sse4                                   [disabled]
151c151
<   -mpclmul                                    [disabled]
---
>   -mpclmul                                    [enabled]
155c155
<   -mpopcnt                                    [disabled]
---
>   -mpopcnt                                    [enabled]
166c166
<   -mrdrnd                                     [disabled]
---
>   -mrdrnd                                     [enabled]
177c177
<   -msahf                                      [disabled]
---
>   -msahf                                      [enabled]
189,191c189,191
<   -msse4                                      [disabled]
<   -msse4.1                                    [disabled]
<   -msse4.2                                    [disabled]
---
>   -msse4                                      [enabled]
>   -msse4.1                                    [enabled]
>   -msse4.2                                    [enabled]
195c195
<   -mssse3                                     [disabled]
---
>   -mssse3                                     [enabled]
202c202
<   -mstore-max=                                128
---
>   -mstore-max=                                256
213c213
<   -mtune=                                     core2
---
>   -mtune=                                     haswell
226c226
<   -mxsave                                     [disabled]
---
>   -mxsave                                     [enabled]
228c228
<   -mxsaveopt                                  [disabled]
---
>   -mxsaveopt                                  [enabled]

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Fortunately it's pretty simple: detect.py would either:

  • Add x86_64h as a target, increasing the universal fat binary size (by up to 50%).
    • It may be possible to add an option to the export template whether to strip x86_64h again from released games, for projects where binary size is more important than speed.
  • Replace x86_64 with x86_64h, cutting support for macs older than 2013.
    • macOS itself cut support for pre-haswell computers in 2021 with Monterey.
    • If we do this, it's reasonable to question whether we don't want to pass march=haswell (or similar but lower, such as nehalem) on other x86_64 targets too, so that Linux and Windows can also benefit from the performance changes.
  • Add the option to add x86_64h to the build, but do not modify any defaults.

Additionally, for non-universal GDExtensions, godot would need to expose an x86_64h feature tag.

If this enhancement will not be used often, can it be worked around with a few lines of script?

It cannot.

Is there a reason why this should be core and not an add-on in the asset library?

It's core.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions