Skip to content

aarch64: Use NEON when SVE width is 128 bits #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AWSjswinney
Copy link

On AArch64 systems with SVE support, 128-bit SVE implementations can perform significantly worse than equivalent NEON code due to the different optimization strategies used in each implementation. The NEON version is unrolled 4 times, providing excellent performance at the fixed 128-bit width. The SVE version can achieve similar or better performance through its variable-width operations on systems with 256-bit or 512-bit SVE, but on 128-bit SVE systems, the NEON unrolled implementation is faster due to reduced overhead.

This change adds runtime detection of SVE vector length and falls back to the optimized NEON implementation when SVE is operating at 128-bit width, ensuring optimal performance across all AArch64 configurations.

This implementation checks the vector length with an intrinsic if the compiler supports it (which works on Apple as well) and falls back to using prctl otherwise.

This optimization ensures that systems benefit from:

  • 4x unrolled NEON code on 128-bit SVE systems
  • Variable-width SVE optimizations on wider SVE implementations
  • Maintained compatibility across different AArch64 configurations

Performance improvement on systems with 128-bit SVE:

  • Encode: 7509.80 MB/s → 8995.59 MB/s (+19.8% improvement)
  • Decode: 9383.67 MB/s → 12272.38 MB/s (+30.8% improvement)

On AArch64 systems with SVE support, 128-bit SVE implementations can
perform significantly worse than equivalent NEON code due to the
different optimization strategies used in each implementation. The NEON
version is unrolled 4 times, providing excellent performance at the
fixed 128-bit width. The SVE version can achieve similar or better
performance through its variable-width operations on systems with
256-bit or 512-bit SVE, but on 128-bit SVE systems, the NEON unrolled
implementation is faster due to reduced overhead.

This change adds runtime detection of SVE vector length and falls back
to the optimized NEON implementation when SVE is operating at 128-bit
width, ensuring optimal performance across all AArch64 configurations.

This implementation checks the vector length with an intrinsic if the
compiler supports it (which works on Apple as well) and falls back to
using prctl otherwise.

This optimization ensures that systems benefit from:
- 4x unrolled NEON code on 128-bit SVE systems
- Variable-width SVE optimizations on wider SVE implementations
- Maintained compatibility across different AArch64 configurations

Performance improvement on systems with 128-bit SVE:
- Encode: 7509.80 MB/s → 8995.59 MB/s (+19.8% improvement)
- Decode: 9383.67 MB/s → 12272.38 MB/s (+30.8% improvement)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant