Skip to content

Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension #2409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

izard
Copy link

@izard izard commented Apr 1, 2025

There are already external/aor/memset and memcpy available.
This PR adds AOR memchr, patched to use SHA512 when available for ~2x performance boost on large strings character search.
This helps some velox benchmarks and presumably workloads. There is a cost of lower performance on very short strings.
build/folly/memchr_bench

[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

std_memchr: size=1 1.51ns 662.53M
std_memchr: size=2 1.51ns 662.53M
std_memchr: size=4 1.51ns 662.32M
std_memchr: size=8 1.51ns 662.32M
std_memchr: size=16 1.51ns 662.53M
std_memchr: size=32 1.65ns 605.18M
std_memchr: size=64 2.13ns 470.25M
std_memchr: size=128 3.24ns 308.85M
std_memchr: size=256 5.87ns 170.31M
std_memchr: size=512 11.77ns 84.95M
std_memchr: size=1024 23.67ns 42.25M
std_memchr: size=2048 48.25ns 20.72M
std_memchr: size=4096 97.41ns 10.27M
std_memchr: size=8192 195.94ns 5.10M
std_memchr: size=16384 399.43ns 2.50M
std_memchr: size=32768 794.17ns 1.26M

folly_memchr: size=1 1.80ns 555.14M
folly_memchr: size=2 1.80ns 555.14M
folly_memchr: size=4 1.80ns 555.14M
folly_memchr: size=8 1.80ns 555.14M
folly_memchr: size=16 1.80ns 555.14M
folly_memchr: size=32 2.10ns 475.49M
folly_memchr: size=64 2.41ns 415.15M
folly_memchr: size=128 3.32ns 301.45M
folly_memchr: size=256 4.22ns 236.82M
folly_memchr: size=512 6.93ns 144.27M
folly_memchr: size=1024 12.06ns 82.89M
folly_memchr: size=2048 22.83ns 43.81M
folly_memchr: size=4096 44.89ns 22.28M
folly_memchr: size=8192 87.97ns 11.37M
folly_memchr: size=16384 174.78ns 5.72M
folly_memchr: size=32768 352.62ns 2.84M

…racter search on ARM devices that support SHA512 instruction
izard added a commit to izard/velox that referenced this pull request Apr 1, 2025
… implementation based on optimized memchr from folly. This provides performance boost to two Like subbenchmarks.

Depends on Folly PR: facebook/folly#2409
@facebook-github-bot
Copy link
Contributor

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


#include <cstring>

#if !defined(__AVX2__) && !(defined(__linux__) && defined(__aarch64__))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an AVX2 implementation somewhere? If not, why are we disabling the fallback when defined(__AVX2__)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut&paste error, will remove the AVX fallback. I can no longer work on a better AVX2 implementation :)

if (size < 128) {
// Libc AARCH64 memchr is more efficient for short strings.
// This dispatch is less efficient than including it as a case for small strings,
// but can't do that because it of GPL
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fix comment typo

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait a second. are you saying that we can't patch AOR because of GPL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fastest code would be if I cut small string handling from GPLed GLIBC and paste into Apache AOR.
But as I can't copy GPLed source code into Apache source code, I have to call it explicitly for small strings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me talk to some folks internally and figure out how we move forward here.

@embg
Copy link

embg commented Apr 1, 2025

The regression for strings up to 32 bytes is pretty concerning. Are you confident it's real, not noise? I don't think we can merge this with the regression, because it will cause a large fraction of memchr in our workloads to use the slower version. I think a lot of strings in our workloads are less than 32 bytes.

Can we patch the AOR implementations to get rid of this regression?

@izard
Copy link
Author

izard commented Apr 2, 2025

The regression for strings up to 32 bytes is pretty concerning. Are you confident it's real, not noise?

It is certainly not noise, libc memcpy takes ~5 CPU cycles for short strings, and proposed new folly memcpy takes ~6 CPU cycles, which first seems like a huge 20% performance regression. However, using execution latency difference of 1 CPU cycle for performance analysis is misleading on OOO core. The real code sequence containing short string search will have few hundreds of u-ops in flight with OOO core utilizing ILP, and this extra CPU cycle might stay, might be masked or even multiplied with some probability, but the performance difference will be still within ~1%.

To back this up, I changed the benchmark in folly to interleave memchr with short memcpy calls, and average performance regression on small strings went down to 4%, and when I added second memchr it went down to 1%, while large strings still shown ~2.2x performance uplift.

I don't think we can merge this with the regression, because it will cause a large fraction of memchr in our workloads to use the slower version. I think a lot of strings in our workloads are less than 32 bytes.

Are there any benchmarks in folly and/or velox that I can try that could demonstrate the effect on smaller strings? So far I ran all velox benchmarks with this change and I only see ~20% and ~25% improvement on two Like sub-benchmarks that were lagging on ARM, and no other difference.

Can we patch the AOR implementations to get rid of this regression?

Technically yes, but I don't like pasting GPL code into Apache code, even if it is just a sequence of 10 assembly instructions.

@embg
Copy link

embg commented Apr 2, 2025

So I discussed this with the cabal internally. We are really worried about regressing the short string case, because that's a really large chunk of memchr usage in our fleet.

We propose adding a new function folly::memchr_long in folly which will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86. Your Velox PR can then call this folly::memchr_long function.

Is this acceptable from your side?

@embg
Copy link

embg commented Apr 2, 2025

Long term, we would urge Nvidia / Arm to upstream the long string optimizations from AOR into glibc.

@izard
Copy link
Author

izard commented Apr 2, 2025

We propose adding a new function folly::memchr_long in folly which will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86. Your Velox PR can then call this folly::memchr_long function.
Is this acceptable from your side?

Thank would work, I'll change the PRs.

Long term, we would urge Nvidia / Arm to upstream the long string optimizations from AOR into glibc.

Makes sense!

…h will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86
izard added a commit to izard/velox that referenced this pull request Apr 2, 2025
@izard
Copy link
Author

izard commented Apr 2, 2025

Changed the Folly and Velox respective PRs based on the conversation above: now folly::memchr_long is the recommended way to search for a character in a string equal or longer than 128 bytes. As there is no dispatch one string length any more, perf difference for 1-32 bytes strings worsened:

============================================================================
[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

std_memchr: size=1 1.53ns 653.88M
std_memchr: size=2 1.53ns 653.27M
std_memchr: size=4 1.53ns 654.31M
std_memchr: size=8 1.53ns 654.51M
std_memchr: size=16 1.53ns 653.26M
std_memchr: size=32 2.23ns 448.24M
std_memchr: size=64 2.68ns 373.11M
std_memchr: size=128 3.80ns 263.31M
std_memchr: size=256 6.34ns 157.83M
std_memchr: size=512 12.16ns 82.24M
std_memchr: size=1024 24.59ns 40.66M
std_memchr: size=2048 49.45ns 20.22M
std_memchr: size=4096 99.31ns 10.07M
std_memchr: size=8192 199.18ns 5.02M
std_memchr: size=16384 405.30ns 2.47M
std_memchr: size=32768 805.16ns 1.24M

folly_memchr: size=1 2.60ns 385.05M
folly_memchr: size=2 2.60ns 385.05M
folly_memchr: size=4 2.60ns 385.12M
folly_memchr: size=8 2.60ns 385.05M
folly_memchr: size=16 2.60ns 385.05M
folly_memchr: size=32 2.60ns 385.05M
folly_memchr: size=64 2.75ns 363.65M
folly_memchr: size=128 3.21ns 311.63M
folly_memchr: size=256 4.28ns 233.74M
folly_memchr: size=512 7.27ns 137.56M
folly_memchr: size=1024 12.66ns 78.99M
folly_memchr: size=2048 23.49ns 42.58M
folly_memchr: size=4096 45.94ns 21.77M
folly_memchr: size=8192 89.72ns 11.15M
folly_memchr: size=16384 177.46ns 5.63M
folly_memchr: size=32768 358.80ns 2.79M

CMakeLists.txt Outdated
set_property(
SOURCE
${AOR_FILE}
APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp"
APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp" "-march=armv8.2-a+crypto+sve2-sha3"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean older Arm chips will not benefit from AOR implementations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.2A is overkill, it is safe to change to 8A. I am not aware about older ARM CPUs with AARCH64 support than 8 family.

@@ -0,0 +1,29 @@
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want a different file path. The other headers under folly/ don't follow the pattern Folly[Name].h.

@yfeldblum can you advise on the right path for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h. Reorganization later might be appropriate.

Alternatively, we can place this within folly/lang/CString.h. That would be a suitable place for ports and variants of things which libc string.h ships (like this large-buffer variant of memchr and the existing port of BSD's strlcpy).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can change the path once decided. This comes from copying FollyMemset.h

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, we shouldn't have merged that :) might be too late to change now...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfeldblum I vote for folly/lang/CString.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg We can always change where eg memset lives if we want to.

Copy link
Contributor

@yfeldblum yfeldblum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR overall. Requesting some tweaks around the edges.

* devices with h/w SHA512 instruction support. MTE compatibility TBD
*/

ENTRY (__folly_memchr_aarch64_simd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include a description of the variant in the name - this is a long-string variant - __folly_memchr_long_aarch64_simd.

@@ -0,0 +1,29 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h. Reorganization later might be appropriate.

Alternatively, we can place this within folly/lang/CString.h. That would be a suitable place for ports and variants of things which libc string.h ships (like this large-buffer variant of memchr and the existing port of BSD's strlcpy).

return __folly_memchr_aarch64;
}

[[gnu::ifunc("__folly_detail_memchr_long_resolve")]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for @embg - the PR would need an edit internally to hook up the Buck build rules.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait wait wait, why do we need ifunc magic for this PR? I don't see how memchr_long is different than any other simd kernel, like a math kernel, where we don't use ifunc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, ifunc magic is no longer required as memchr is no longer redefined, I'll fix tomorrow my morning time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually after second thought and reading your comment regarding older ARMs, this runtime dispatch still selects between SHA512 and generic AOR version, where SHA512 is not available.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be implemented using ifunc. It can be either compile-time dispatch (using macro guards) or runtime dispatch (using cpuid).

ifunc is solving a very specific problem where other translation units call memset and we want to replace that with folly memset.

In this case you have the ability to write a wrapper which dispatches correctly, so ifunc isn't needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally we do compile with sha3 enabled, so compile-time is fine for us. If runtime dispatch isn't completely trivial to implement, I think compile-time dispatch is fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ifunc, added compile time dispatch.

decltype(&__folly_memchr_aarch64) __folly_detail_memchr_long_resolve(
uint64_t hwcaps, const void* arg2) {
#if defined(_IFUNC_ARG_HWCAP)
if (hwcaps & HWCAP_SHA3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for @embg - the PR would need a pass of clang-format (via arc f internally).

@embg
Copy link

embg commented Apr 7, 2025

I will take another pass looking at the new changes tomorrow, if it looks good I will pull this in and work on merging.

@facebook-github-bot
Copy link
Contributor

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@embg
Copy link

embg commented Apr 8, 2025

There's a very simple issue which was caught by internal CI. Should be trivial to fix.

folly/folly/FollyMemchr.cpp:23:21: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
   23 |   return std::memchr(ptr, ch, count);
      |          ~~~~~~~~~~~^~~~~~~~~~~~~~~~
      |                     |
      |                     const void*
folly/folly/FollyMemchr.cpp: At global scope:
folly/folly/FollyMemchr.cpp:41:1: error: expected declaration before ‘}’ token
   41 | } // namespace folly
folly/folly/FollyMemchr.cpp:23:10: error: cannot initialize return object of type 'void *' with an rvalue of type 'const void *'
   23 |   return std::memchr(ptr, ch, count);
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
folly/folly/FollyMemchr.cpp:41:1: error: extraneous closing brace ('}')
   41 | } // namespace folly
      | ^

Edit: I got someone to approve the workflows on this PR and they are now failing in OSS CI as well (see below).


extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) {
#ifdef FOLLY_ARM_FEATURE_SHA3
return __folly_memchr_long_aarch64_sha512(ptr, ch, count);
Copy link

@embg embg Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA-512 usually refers to a variant of SHA-2. You can ctrl-f for "SHA-512" on this page to see what I mean: https://en.wikipedia.org/wiki/SHA-2

That's a bit confusing because we are talking about Arm's SHA-3 instructions, not SHA-2 instructions.

Since you have to fix the CI failure anyway, please change all SHA-512 references to SHA-3. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg I notice that the implementation uses the sha512h instruction and the FEAT_SHA512 feature-set. I would have assumed that this actually refers to sha512 as vs sha3?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I’m confused why we are gating with FOLLY_ARM_FEATURE_SHA3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to check for FOLLY_ARM_FEATURE_SHA3. Originally I experimented with BCAX and SHA1C which are part of SHA3, and later switched to SHA512 which is most efficient for the cause, but forgot to change the feature set. All ARMs I know that have SHA3 also have SHA2, but I am not sure if older Arms with SHA2 only support would have the 2/1 latency/throughput retiring for this instruction. I mean it will work on older Arms, but probably not as fast.

@embg embg changed the title Added AOR memchr implementation that uses SHA512 optimization Added AOR memchr implementation that uses instructions from aarch64's SHA3 ISA extension Apr 8, 2025
@@ -0,0 +1,29 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg We can always change where eg memset lives if we want to.


extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) {
#ifdef FOLLY_ARM_FEATURE_SHA3
return __folly_memchr_long_aarch64_sha512(ptr, ch, count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embg I notice that the implementation uses the sha512h instruction and the FEAT_SHA512 feature-set. I would have assumed that this actually refers to sha512 as vs sha3?

Comment on lines +74 to +78
volatile static uint64_t result_offset = 0;
static FOLLY_NOINLINE void* std_memchr(const void *s, int c, size_t l) {
result_offset = (uint64_t)std::memchr((void *)s, c, l)-(uint64_t)s;
return (void*)((uint64_t)s + result_offset);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use folly::compiler_must_not_predict and folly::compiler_must_not_elide for this purpose. Also wrapped as folly::doNotPredict and folly::doNotElide in the benchmark library.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification note: instead of volatile

@embg
Copy link

embg commented Apr 9, 2025

I think this is very close to being ready, I am going to pull it in and will handle any remaining nits myself.

@embg
Copy link

embg commented Apr 9, 2025

@izard Last request I have from you. Could you take a look at why CMake is failing?

CMake Error at C:/Program Files/CMake/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Boost (missing: Boost_INCLUDE_DIR context filesystem
  program_options regex system thread) (Required is at least version
  "1.51.0")

Example failing job: https://github.com/facebook/folly/actions/runs/14361949326/job/40268841547?pr=2409

I see that you did touch CMakeLists.txt. Is it possible that your change there broke CMake somehow?

@izard
Copy link
Author

izard commented Apr 9, 2025

@izard Last request I have from you. Could you take a look at why CMake is failing?
CMake Error at C:/Program Files/CMake/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
Could NOT find Boost .. (Required is at least version "1.51.0")
Example failing job: https://github.com/facebook/folly/actions/runs/14361949326/job/40268841547?pr=2409
I see that you did touch CMakeLists.txt. Is it possible that your change there broke CMake somehow?

I don't have Windows to test, only Mac and Linux. But in both my envs I just manually installed Boost to satisfy this requirement. I did not add the Boost requirement, my only changes to Cmake are adding memchr test, benchmark, and source files; and changing asm compiler flags, this should not affect boost. So I think master version of folly won't compile in your environment either.

@izard izard changed the title Added AOR memchr implementation that uses instructions from aarch64's SHA3 ISA extension Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension Apr 10, 2025
@facebook-github-bot
Copy link
Contributor

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +234 to +236
L(zero_length):
mov result, #0
ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting this 2-instruction / 8-byte sequence, and replacing each of 6 occurrences with a 1-instruction / 4-byte sequence, reduces the size of the function by 4 instructions / 16 bytes overall.

However, this 2-instruction sequence is definitely in a different cacheline from each of the occurrences it replaces.

Has this tradeoff been analyzed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting observation. This only would matter for performance on very small inputs, and this function is only called from memchr_long.

10% perf improvement from re-arranging SHA512 daisy chain tree which enables using second half of result, sparing horizontal sum instruction.
@facebook-github-bot
Copy link
Contributor

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants