[a64] Implement an ARM64 backend #2259

Wunkolo · 2024-05-15T02:01:56Z

Implements a 64-bit ARM backend that emits a64 instructions using oaknut.

Depends on #2258 and xenia-project/FFmpeg#8

Addresses #2002

Tested on a ThinkPad X13s and uses unit tests from #1348 as well. There is currently a ARMv8.1-a requirement due to the use of some of the newer atomic instructions such as CASAL.

Separates the `Windows` platform into `Windows-x86_64` and `Windows-ARM64`. Adds `--arch` argument to `build`. Removes x64 backend on non-x64 targets.

Marked as TODO for now

Uses intrinsics from https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

Adding the `a64` backend will be a different PR. For now it's stubbed to the null backend to allow the main executable to open without failing initalization.

This value is currently returning `0` on ARM machines and throws an exception.

Adds the new `xenia-cpu-backend-a64` build-target with linkage following the x64 backend.

Header-only library for emitting arm64v8 instructions. Enables C++20 only for the a64 backend for now

Mostly element-accessors

First pass framework that gets emitted ARM code executing. Based on the x64 backend, implements an ARM64 JIT backend.

This just reverses the bytes of 32-bit values, not reverse the whole vector.

Wrong register index and vector-register size

These calls need to preserve and restore the `lr` register. Unit tests all run now!

These are stomping over X0 and Q0 which is returning input argument registers as return values. Fixes some guest-to-host calls.

Vector registers are passed as pointers rather than directly in the `Qn` registers. So these functions should be taking pointer-type arguments rather than vector-register types directly. Fixes `OPCODE_VECTOR_SHL` and passes unit tests.

We dont load it back so no need to store it

Passes all unit tests

Uses MOVI to optimize some cases of constants rather than EOR. MOVI is a register-renaming idiom on many architectures.

The LSL can be embedded into the ADD to remove an additional instruction. What was `cset`+`lsl`+`add` should now just be `cset`+`add ... LSL 12`

Use pair-stores rather than singular-stores to write 32-bytes of data at a time.

Uses the `CNTVCT_EL0`-register and applies frequency scaling

Wunkolo · 2024-05-23T02:10:52Z

Debugger, instruction-stepping, call-stack unwinding, etc have been implemented as well:

Passes cpu-ppc-tests

This is a very literal translation from the x64 code into ARM and may not be very optimized. Passes unit test save for a couple off-by-one errors.

Adds two new flags for allowing the use of LSE and FP16C

Narrow-saturation instructions causes off-by-one rounding errors. Using the min+max+shuffle passes more unit tests

Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value. Reduces the amount of instructions for each VConst memory load.

Detect when all bytes are repeating and use `MOVI` when applicable

Indices and non-const tables were using the same scratch-register

Uses `CNTFRQ` and `CNTVCT` system-registers as a raw clock source. On my ThinkPad x13s, the raw clock source returns a tick-frequency of 19,200,000 while the platform clock source(QueryPerformanceFrequency) returns 10,000,000. Almost double the accuracy over the platform-clock!

Misses some during the first pass. Now the config files with mention a64 differences.

Read direction from the ZR in the case that we are just storing a 64 or 32 bit zero

This directly maps to the QC bit in the FPSR. Just have to make sure that the saturated instruction is the very last instruction(which is currently the case for stuff like VECTOR_ADD and such).

Wunkolo · 2024-05-28T20:41:04Z

Latest iteration running Beautiful Katamari and Geometry Wars. Still some minor issues but serving gameplay now.

kata.mp4

geo.wars.mp4

The 64-bit cases uses a particular Replicated 8-bit immediate so something else will have to handle that This cases a lot of cases without having to touch memory. Does not catch cases of `1.0`(0x3f800000).

`FMOV` encodes an 8-bit floating point immediate that can be used to accelerate the loading of certain constant floating point values between -31.0 and 32.0. A lot of immediates such as -1.0, 1.0, 0.5, etc fall within this range and this code gets lots of hits in my testing. This is much more optimal than trying to load a 32/64-bit value in W0/X0 and moving it into an FP register.

Uses LSE when available, but provides an armv8.0 baseline implementation.

Wunkolo · 2024-05-29T17:56:46Z

No longer requires Armv8.1. Instructions are emitted with an Armv8.0-a baseline and will detect features such as FP16 and LSE and such before utilizing them(and expose them in the feature-mask config similar to x64).

Removes all comments relating to x64 implementation details

`dc civac` causes an illegal-instruciton on Windows-ARM. This is likely as a security measure against cache-attacks. On Linux this instruction is trapped into an EL1 kernel function. Windows does not seem to have any user-mode cache-maintenance instructions available for data-cache(only instruction-cache via `FlushInstructionCache`). The closest thing we can do for now is a full data memory-barrier with `dsb ish`. Prefetches are implemented using `prfm pldl1keep, ...`.

Out-of-bound shift-values are handled as modulo-element-size

Wunkolo added 30 commits April 27, 2024 16:45

[Build] Add Windows ARM64 support

1746177

Separates the `Windows` platform into `Windows-x86_64` and `Windows-ARM64`. Adds `--arch` argument to `build`. Removes x64 backend on non-x64 targets.

[Base] Add Windows-ARM64 exception handling

a6d9113

[CPU] Add Windows ARM64 stack-walker

1874f0c

[ImGui] Stub ARM64 host debug text

b48ec84

Marked as TODO for now

[Base] Disable AVX check on ARM64

f254848

[CPU] Disable x64 backend on ARM64

fe9c98e

[Base] Add Windows-ARM64 bit_count implementation

045441a

Uses intrinsics from https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

[CPU] Stub ARM64 to Null CPU backend

f2b05ea

Adding the `a64` backend will be a different PR. For now it's stubbed to the null backend to allow the main executable to open without failing initalization.

[UI] Fix divide-by-zero hazard

aa4a3e0

This value is currently returning `0` on ARM machines and throws an exception.

[CPU] Add ARM64 backend build target

b6355f1

Adds the new `xenia-cpu-backend-a64` build-target with linkage following the x64 backend.

[a64] Integrate oaknut submodule

071e1eb

Header-only library for emitting arm64v8 instructions. Enables C++20 only for the a64 backend for now

[Base] Add ARM64 utility functions

be845c1

Mostly element-accessors

[CPU] Implement ARM64 CPU backend

307d821

First pass framework that gets emitted ARM code executing. Based on the x64 backend, implements an ARM64 JIT backend.

[a64] Fix BYTE_SWAP_V128

50e3b9f

This just reverses the bytes of 32-bit values, not reverse the whole vector.

[a64] Implement OPCODE_EXTRACT

be88f31

[a64] Implement OPCODE_SPLAT

60ea2d5

[a64] Implement OPCODE_INSERT

ca3eb1a

[a64] Implement OPCODE_LOAD_VECTOR_SHL

1ff3dad

[a64] Implement OPCODE_LOAD_VECTOR_SHR

bc84198

[a64] Implement OPCODE_PACK(D3DCOLOR)

f8ebd2d

[a64] Implement OPCODE_VECTOR_SHA

8ea2712

[a64] Implement OPCODE_{SHR,SHA}

e9c5897

[a64] Fix StackLayout

524b83e

Wrong register index and vector-register size

[a64] Fix Guest-To-Host native calls

feb8374

These calls need to preserve and restore the `lr` register. Unit tests all run now!

[a64] Fix memory address generation

c8c1171

[a64] Fix indirect and external calls

2652240

[a64] Fix overwriting of return-value registers

6d239c8

These are stomping over X0 and Q0 which is returning input argument registers as return values. Fixes some guest-to-host calls.

[a64] Implement OPCODE_VECTOR_SHL

b468c31

Vector registers are passed as pointers rather than directly in the `Qn` registers. So these functions should be taking pointer-type arguments rather than vector-register types directly. Fixes `OPCODE_VECTOR_SHL` and passes unit tests.

[a64] Remove volatile storing of X0/Q0

1c8e29e

We dont load it back so no need to store it

[a64] Implement OPCODE_VECTOR_{SHR,SHA}

1d673f1

Passes all unit tests

Wunkolo added 4 commits May 21, 2024 09:31

[a64] Optimize vector-constant generation

69b16c7

Uses MOVI to optimize some cases of constants rather than EOR. MOVI is a register-renaming idiom on many architectures.

[a64] Optimize memory-address calculation

e94a7f2

The LSL can be embedded into the ADD to remove an additional instruction. What was `cset`+`lsl`+`add` should now just be `cset`+`add ... LSL 12`

[a64] Optimize OPCODE_MEMSET

0036991

Use pair-stores rather than singular-stores to write 32-bytes of data at a time.

[a64] Implement OPCODE_LOAD_CLOCk clock_source_raw

1223b68

Uses the `CNTVCT_EL0`-register and applies frequency scaling

Wunkolo added 14 commits May 23, 2024 09:39

[a64] Implement OPCODE_PACK(2101010, 4202020, 8-in-16, 16-in-32)

9f950f3

[a64] Fix OPCODE_PACK saturation edge-cases

9d0c321

Passes cpu-ppc-tests

[a64] Implement OPCODE_UNPACK

da8a790

This is a very literal translation from the x64 code into ARM and may not be very optimized. Passes unit test save for a couple off-by-one errors.

[a64] Implement LSE and FP16C detection

32d4f47

Adds two new flags for allowing the use of LSE and FP16C

[a64] Optimize OPCODE_{UN}PACK(float16) with F16C

8da8335

[a64] Fix OPCODE_PACK(short)

177a4db

Narrow-saturation instructions causes off-by-one rounding errors. Using the min+max+shuffle passes more unit tests

[a64] Optimize bulk VConst access with relative addressing

5c0fbcb

Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value. Reduces the amount of instructions for each VConst memory load.

[a64] Optimize constant vector byte-splats

b3e9b58

Detect when all bytes are repeating and use `MOVI` when applicable

[a64] Fix OPCODE_SWIZZLE register-aliasing

6d34fc7

Indices and non-const tables were using the same scratch-register

[a64] Remove VOne constant in favor of FMOV

4dbe6c3

[a64] Add arch-agnostic documentation configurations

ffd2a1e

Misses some during the first pass. Now the config files with mention a64 differences.

[a64] Optimize zero MovMem64

c1b59d7

Read direction from the ZR in the case that we are just storing a 64 or 32 bit zero

[a64] Implement OPCODE_DID_SATURATE

e82f9ea

This directly maps to the QC bit in the FPSR. Just have to make sure that the saturated instruction is the very last instruction(which is currently the case for stuff like VECTOR_ADD and such).

Wunkolo added 2 commits May 28, 2024 14:39

[a64] Detect MOVI utilizations for vector-element splats(u8,u16,u32)

0f50d6a

The 64-bit cases uses a particular Replicated 8-bit immediate so something else will have to handle that This cases a lot of cases without having to touch memory. Does not catch cases of `1.0`(0x3f800000).

Wunkolo force-pushed the arm64-backend branch from 47d801f to 0766b7a Compare May 28, 2024 23:21

[a64] Implement armv8.0 atomic operations

d4a9a10

Uses LSE when available, but provides an armv8.0 baseline implementation.

[a64] Remove x64 reference implementations

1076c28

Removes all comments relating to x64 implementation details

Wunkolo force-pushed the arm64-backend branch from 256bf3f to ece1e0c Compare June 2, 2024 19:03

Wunkolo force-pushed the arm64-backend branch from ece1e0c to c785488 Compare June 2, 2024 19:05

[a64] Fix out-of-bounds OPCODE_VECTOR_SHL(all-same) case

d926928

Out-of-bound shift-values are handled as modulo-element-size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[a64] Implement an ARM64 backend #2259

[a64] Implement an ARM64 backend #2259

Wunkolo commented May 15, 2024

Wunkolo commented May 23, 2024

Wunkolo commented May 28, 2024

Wunkolo commented May 29, 2024

[a64] Implement an ARM64 backend #2259

Are you sure you want to change the base?

[a64] Implement an ARM64 backend #2259

Conversation

Wunkolo commented May 15, 2024

Wunkolo commented May 23, 2024

Wunkolo commented May 28, 2024

Wunkolo commented May 29, 2024