Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension #2409

izard · 2025-04-01T12:23:46Z

There are already external/aor/memset and memcpy available.
This PR adds AOR memchr, patched to use SHA512 when available for ~2x performance boost on large strings character search.
This helps some velox benchmarks and presumably workloads. There is a cost of lower performance on very short strings.
build/folly/memchr_bench

[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

std_memchr: size=1 1.51ns 662.53M
std_memchr: size=2 1.51ns 662.53M
std_memchr: size=4 1.51ns 662.32M
std_memchr: size=8 1.51ns 662.32M
std_memchr: size=16 1.51ns 662.53M
std_memchr: size=32 1.65ns 605.18M
std_memchr: size=64 2.13ns 470.25M
std_memchr: size=128 3.24ns 308.85M
std_memchr: size=256 5.87ns 170.31M
std_memchr: size=512 11.77ns 84.95M
std_memchr: size=1024 23.67ns 42.25M
std_memchr: size=2048 48.25ns 20.72M
std_memchr: size=4096 97.41ns 10.27M
std_memchr: size=8192 195.94ns 5.10M
std_memchr: size=16384 399.43ns 2.50M
std_memchr: size=32768 794.17ns 1.26M

folly_memchr: size=1 1.80ns 555.14M
folly_memchr: size=2 1.80ns 555.14M
folly_memchr: size=4 1.80ns 555.14M
folly_memchr: size=8 1.80ns 555.14M
folly_memchr: size=16 1.80ns 555.14M
folly_memchr: size=32 2.10ns 475.49M
folly_memchr: size=64 2.41ns 415.15M
folly_memchr: size=128 3.32ns 301.45M
folly_memchr: size=256 4.22ns 236.82M
folly_memchr: size=512 6.93ns 144.27M
folly_memchr: size=1024 12.06ns 82.89M
folly_memchr: size=2048 22.83ns 43.81M
folly_memchr: size=4096 44.89ns 22.28M
folly_memchr: size=8192 87.97ns 11.37M
folly_memchr: size=16384 174.78ns 5.72M
folly_memchr: size=32768 352.62ns 2.84M

…racter search on ARM devices that support SHA512 instruction

… implementation based on optimized memchr from folly. This provides performance boost to two Like subbenchmarks. Depends on Folly PR: facebook/folly#2409

facebook-github-bot · 2025-04-01T15:56:12Z

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

embg · 2025-04-01T16:05:26Z

folly/FollyMemchr.cpp

+
+#include <cstring>
+
+#if !defined(__AVX2__) && !(defined(__linux__) && defined(__aarch64__))


Is there an AVX2 implementation somewhere? If not, why are we disabling the fallback when defined(__AVX2__)?

cut&paste error, will remove the AVX fallback. I can no longer work on a better AVX2 implementation :)

izard · 2025-04-01T16:09:04Z

folly/memchr_select_aarch64.cpp

+    if (size < 128) {
+    // Libc AARCH64 memchr is more efficient for short strings. 
+    // This dispatch is less efficient than including it as a case for small strings,
+    // but can't do that because it of GPL


to fix comment typo

wait a second. are you saying that we can't patch AOR because of GPL?

The fastest code would be if I cut small string handling from GPLed GLIBC and paste into Apache AOR.
But as I can't copy GPLed source code into Apache source code, I have to call it explicitly for small strings.

Let me talk to some folks internally and figure out how we move forward here.

embg · 2025-04-01T16:11:07Z

The regression for strings up to 32 bytes is pretty concerning. Are you confident it's real, not noise? I don't think we can merge this with the regression, because it will cause a large fraction of memchr in our workloads to use the slower version. I think a lot of strings in our workloads are less than 32 bytes.

Can we patch the AOR implementations to get rid of this regression?

izard · 2025-04-02T09:36:02Z

The regression for strings up to 32 bytes is pretty concerning. Are you confident it's real, not noise?

It is certainly not noise, libc memcpy takes ~5 CPU cycles for short strings, and proposed new folly memcpy takes ~6 CPU cycles, which first seems like a huge 20% performance regression. However, using execution latency difference of 1 CPU cycle for performance analysis is misleading on OOO core. The real code sequence containing short string search will have few hundreds of u-ops in flight with OOO core utilizing ILP, and this extra CPU cycle might stay, might be masked or even multiplied with some probability, but the performance difference will be still within ~1%.

To back this up, I changed the benchmark in folly to interleave memchr with short memcpy calls, and average performance regression on small strings went down to 4%, and when I added second memchr it went down to 1%, while large strings still shown ~2.2x performance uplift.

I don't think we can merge this with the regression, because it will cause a large fraction of memchr in our workloads to use the slower version. I think a lot of strings in our workloads are less than 32 bytes.

Are there any benchmarks in folly and/or velox that I can try that could demonstrate the effect on smaller strings? So far I ran all velox benchmarks with this change and I only see ~20% and ~25% improvement on two Like sub-benchmarks that were lagging on ARM, and no other difference.

Can we patch the AOR implementations to get rid of this regression?

Technically yes, but I don't like pasting GPL code into Apache code, even if it is just a sequence of 10 assembly instructions.

…able in this PR

embg · 2025-04-02T11:55:22Z

So I discussed this with the cabal internally. We are really worried about regressing the short string case, because that's a really large chunk of memchr usage in our fleet.

We propose adding a new function folly::memchr_long in folly which will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86. Your Velox PR can then call this folly::memchr_long function.

Is this acceptable from your side?

embg · 2025-04-02T11:57:52Z

Long term, we would urge Nvidia / Arm to upstream the long string optimizations from AOR into glibc.

izard · 2025-04-02T12:34:17Z

We propose adding a new function folly::memchr_long in folly which will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86. Your Velox PR can then call this folly::memchr_long function.
Is this acceptable from your side?

Thank would work, I'll change the PRs.

Long term, we would urge Nvidia / Arm to upstream the long string optimizations from AOR into glibc.

Makes sense!

…h will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86

…n revised facebook/folly#2409

izard · 2025-04-02T14:28:20Z

Changed the Folly and Velox respective PRs based on the conversation above: now folly::memchr_long is the recommended way to search for a character in a string equal or longer than 128 bytes. As there is no dispatch one string length any more, perf difference for 1-32 bytes strings worsened:

============================================================================
[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

std_memchr: size=1 1.53ns 653.88M
std_memchr: size=2 1.53ns 653.27M
std_memchr: size=4 1.53ns 654.31M
std_memchr: size=8 1.53ns 654.51M
std_memchr: size=16 1.53ns 653.26M
std_memchr: size=32 2.23ns 448.24M
std_memchr: size=64 2.68ns 373.11M
std_memchr: size=128 3.80ns 263.31M
std_memchr: size=256 6.34ns 157.83M
std_memchr: size=512 12.16ns 82.24M
std_memchr: size=1024 24.59ns 40.66M
std_memchr: size=2048 49.45ns 20.22M
std_memchr: size=4096 99.31ns 10.07M
std_memchr: size=8192 199.18ns 5.02M
std_memchr: size=16384 405.30ns 2.47M
std_memchr: size=32768 805.16ns 1.24M

folly_memchr: size=1 2.60ns 385.05M
folly_memchr: size=2 2.60ns 385.05M
folly_memchr: size=4 2.60ns 385.12M
folly_memchr: size=8 2.60ns 385.05M
folly_memchr: size=16 2.60ns 385.05M
folly_memchr: size=32 2.60ns 385.05M
folly_memchr: size=64 2.75ns 363.65M
folly_memchr: size=128 3.21ns 311.63M
folly_memchr: size=256 4.28ns 233.74M
folly_memchr: size=512 7.27ns 137.56M
folly_memchr: size=1024 12.66ns 78.99M
folly_memchr: size=2048 23.49ns 42.58M
folly_memchr: size=4096 45.94ns 21.77M
folly_memchr: size=8192 89.72ns 11.15M
folly_memchr: size=16384 177.46ns 5.63M
folly_memchr: size=32768 358.80ns 2.79M

embg · 2025-04-02T15:03:51Z

CMakeLists.txt

    set_property(
      SOURCE
      ${AOR_FILE}
-      APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp"
+      APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp" "-march=armv8.2-a+crypto+sve2-sha3"


Does this mean older Arm chips will not benefit from AOR implementations?

8.2A is overkill, it is safe to change to 8A. I am not aware about older ARM CPUs with AARCH64 support than 8 family.

embg · 2025-04-02T15:06:06Z

folly/FollyMemchr.h

@@ -0,0 +1,29 @@
+/*


I think we may want a different file path. The other headers under folly/ don't follow the pattern Folly[Name].h.

@yfeldblum can you advise on the right path for this?

@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h. Reorganization later might be appropriate.

Alternatively, we can place this within folly/lang/CString.h. That would be a suitable place for ports and variants of things which libc string.h ships (like this large-buffer variant of memchr and the existing port of BSD's strlcpy).

Sure, I can change the path once decided. This comes from copying FollyMemset.h

Oh man, we shouldn't have merged that :) might be too late to change now...

@yfeldblum I vote for folly/lang/CString.h.

@embg We can always change where eg memset lives if we want to.

yfeldblum

Nice PR overall. Requesting some tweaks around the edges.

yfeldblum · 2025-04-02T16:30:18Z

folly/external/aor/memchr-advsimd.S

+ * devices with h/w SHA512 instruction support. MTE compatibility TBD
+ */
+
+ENTRY (__folly_memchr_aarch64_simd)


Let's include a description of the variant in the name - this is a long-string variant - __folly_memchr_long_aarch64_simd.

yfeldblum · 2025-04-02T16:31:34Z

folly/FollyMemchr.h

@@ -0,0 +1,29 @@
+/*


@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h. Reorganization later might be appropriate.

Alternatively, we can place this within folly/lang/CString.h. That would be a suitable place for ports and variants of things which libc string.h ships (like this large-buffer variant of memchr and the existing port of BSD's strlcpy).

folly/external/aor/memchr-armv8.S

folly/test/MemchrBenchmark.cpp

folly/test/MemchrTest.cpp

yfeldblum · 2025-04-02T16:45:23Z

folly/memchr_select_aarch64.cpp

+  return __folly_memchr_aarch64;
+}
+
+[[gnu::ifunc("__folly_detail_memchr_long_resolve")]]


Note for @embg - the PR would need an edit internally to hook up the Buck build rules.

Wait wait wait, why do we need ifunc magic for this PR? I don't see how memchr_long is different than any other simd kernel, like a math kernel, where we don't use ifunc.

You are right, ifunc magic is no longer required as memchr is no longer redefined, I'll fix tomorrow my morning time.

Actually after second thought and reading your comment regarding older ARMs, this runtime dispatch still selects between SHA512 and generic AOR version, where SHA512 is not available.

This doesn't need to be implemented using ifunc. It can be either compile-time dispatch (using macro guards) or runtime dispatch (using cpuid).

ifunc is solving a very specific problem where other translation units call memset and we want to replace that with folly memset.

In this case you have the ability to write a wrapper which dispatches correctly, so ifunc isn't needed.

Internally we do compile with sha3 enabled, so compile-time is fine for us. If runtime dispatch isn't completely trivial to implement, I think compile-time dispatch is fine.

Removed ifunc, added compile time dispatch.

yfeldblum · 2025-04-02T16:46:33Z

folly/memchr_select_aarch64.cpp

+decltype(&__folly_memchr_aarch64) __folly_detail_memchr_long_resolve(
+    uint64_t hwcaps, const void* arg2) {
+#if defined(_IFUNC_ARG_HWCAP)
+    if (hwcaps & HWCAP_SHA3) {


Note for @embg - the PR would need a pass of clang-format (via arc f internally).

folly/memchr_select_aarch64.cpp

…hr versions as discussed in the PR comments

embg · 2025-04-07T21:57:49Z

I will take another pass looking at the new changes tomorrow, if it looks good I will pull this in and work on merging.

facebook-github-bot · 2025-04-07T22:04:14Z

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

embg · 2025-04-08T16:18:57Z

There's a very simple issue which was caught by internal CI. Should be trivial to fix.

folly/folly/FollyMemchr.cpp:23:21: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
   23 |   return std::memchr(ptr, ch, count);
      |          ~~~~~~~~~~~^~~~~~~~~~~~~~~~
      |                     |
      |                     const void*
folly/folly/FollyMemchr.cpp: At global scope:
folly/folly/FollyMemchr.cpp:41:1: error: expected declaration before ‘}’ token
   41 | } // namespace folly

folly/folly/FollyMemchr.cpp:23:10: error: cannot initialize return object of type 'void *' with an rvalue of type 'const void *'
   23 |   return std::memchr(ptr, ch, count);
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
folly/folly/FollyMemchr.cpp:41:1: error: extraneous closing brace ('}')
   41 | } // namespace folly
      | ^

Edit: I got someone to approve the workflows on this PR and they are now failing in OSS CI as well (see below).

embg · 2025-04-08T19:33:16Z

folly/FollyMemchr.cpp

+
+extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) {
+#ifdef FOLLY_ARM_FEATURE_SHA3
+  return __folly_memchr_long_aarch64_sha512(ptr, ch, count);


SHA-512 usually refers to a variant of SHA-2. You can ctrl-f for "SHA-512" on this page to see what I mean: https://en.wikipedia.org/wiki/SHA-2

That's a bit confusing because we are talking about Arm's SHA-3 instructions, not SHA-2 instructions.

Since you have to fix the CI failure anyway, please change all SHA-512 references to SHA-3. Thanks.

@embg I notice that the implementation uses the sha512h instruction and the FEAT_SHA512 feature-set. I would have assumed that this actually refers to sha512 as vs sha3?

Then I’m confused why we are gating with FOLLY_ARM_FEATURE_SHA3

Fixed to check for FOLLY_ARM_FEATURE_SHA3. Originally I experimented with BCAX and SHA1C which are part of SHA3, and later switched to SHA512 which is most efficient for the cause, but forgot to change the feature set. All ARMs I know that have SHA3 also have SHA2, but I am not sure if older Arms with SHA2 only support would have the 2/1 latency/throughput retiring for this instruction. I mean it will work on older Arms, but probably not as fast.

yfeldblum · 2025-04-08T20:59:51Z

folly/FollyMemchr.h

@@ -0,0 +1,29 @@
+/*


@embg We can always change where eg memset lives if we want to.

yfeldblum · 2025-04-08T21:03:03Z

folly/FollyMemchr.cpp

+
+extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) {
+#ifdef FOLLY_ARM_FEATURE_SHA3
+  return __folly_memchr_long_aarch64_sha512(ptr, ch, count);


@embg I notice that the implementation uses the sha512h instruction and the FEAT_SHA512 feature-set. I would have assumed that this actually refers to sha512 as vs sha3?

folly/test/MemchrBenchmark.cpp

folly/test/MemchrTest.cpp

yfeldblum · 2025-04-09T16:16:34Z

folly/test/MemchrBenchmark.cpp

+volatile static uint64_t result_offset = 0;
+static FOLLY_NOINLINE void* std_memchr(const void *s, int c, size_t l) {
+  result_offset =  (uint64_t)std::memchr((void *)s, c, l)-(uint64_t)s;
+  return (void*)((uint64_t)s + result_offset);
+}


We generally use folly::compiler_must_not_predict and folly::compiler_must_not_elide for this purpose. Also wrapped as folly::doNotPredict and folly::doNotElide in the benchmark library.

clarification note: instead of volatile

embg · 2025-04-09T17:06:24Z

I think this is very close to being ready, I am going to pull it in and will handle any remaining nits myself.

embg · 2025-04-09T17:11:11Z

@izard Last request I have from you. Could you take a look at why CMake is failing?

CMake Error at C:/Program Files/CMake/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Boost (missing: Boost_INCLUDE_DIR context filesystem
  program_options regex system thread) (Required is at least version
  "1.51.0")

Example failing job: https://github.com/facebook/folly/actions/runs/14361949326/job/40268841547?pr=2409

I see that you did touch CMakeLists.txt. Is it possible that your change there broke CMake somehow?

izard · 2025-04-09T17:23:17Z

@izard Last request I have from you. Could you take a look at why CMake is failing?
CMake Error at C:/Program Files/CMake/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
Could NOT find Boost .. (Required is at least version "1.51.0")
Example failing job: https://github.com/facebook/folly/actions/runs/14361949326/job/40268841547?pr=2409
I see that you did touch CMakeLists.txt. Is it possible that your change there broke CMake somehow?

I don't have Windows to test, only Mac and Linux. But in both my envs I just manually installed Boost to satisfy this requirement. I did not add the Boost requirement, my only changes to Cmake are adding memchr test, benchmark, and source files; and changing asm compiler flags, this should not affect boost. So I think master version of folly won't compile in your environment either.

facebook-github-bot · 2025-04-15T18:25:16Z

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

yfeldblum · 2025-05-12T15:59:58Z

folly/external/aor/memchr-advsimd.S

+L(zero_length):
+	mov	result, #0
+	ret


Extracting this 2-instruction / 8-byte sequence, and replacing each of 6 occurrences with a 1-instruction / 4-byte sequence, reduces the size of the function by 4 instructions / 16 bytes overall.

However, this 2-instruction sequence is definitely in a different cacheline from each of the occurrences it replaces.

Has this tradeoff been analyzed?

Very interesting observation. This only would matter for performance on very small inputs, and this function is only called from memchr_long.

10% perf improvement from re-arranging SHA512 daisy chain tree which enables using second half of result, sparing horizontal sum instruction.

facebook-github-bot · 2025-05-19T19:37:56Z

@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Added AOR memchr implementation, changed to use SHA512 for faster cha…

a425170

…racter search on ARM devices that support SHA512 instruction

facebook-github-bot added the CLA Signed label Apr 1, 2025

izard mentioned this pull request Apr 1, 2025

Replaces 2x string.find(..) with custom one using optimized folly:memchr facebookincubator/velox#12879

Draft

embg reviewed Apr 1, 2025

View reviewed changes

izard commented Apr 1, 2025

View reviewed changes

Cut&paste error fixed, AVX2 asm implementation of memchr is not avail…

ab8f0e7

…able in this PR

Changed the PR adding a new function folly::memchr_long in folly whic…

4091ac8

…h will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86

izard added a commit to izard/velox that referenced this pull request Apr 2, 2025

Changed the PR to explicitly use new folly::memchr_long, introduced i…

48d5135

…n revised facebook/folly#2409

embg reviewed Apr 2, 2025

View reviewed changes

yfeldblum reviewed Apr 2, 2025

View reviewed changes

izard added 8 commits April 3, 2025 13:21

Resolving code review comments, minor refactoring.

e33869b

Changed ARM8.2A to ARM8A requirement based on code review conversation

ed5d704

Minor code cleanup after review

1e2c894

Removed ifunc, added compile-time dispatch for AARCH64 optimized memc…

69f54ee

…hr versions as discussed in the PR comments

Tiny optimization for small string/not found case, saving few CPU cycles

f626acc

Resolved coding style review comments

ca394e6

Resolved coding style review comments

b4c51cd

Using folly prevPow2 as suggested in code review

be66e49

embg reviewed Apr 8, 2025

View reviewed changes

embg changed the title ~~Added AOR memchr implementation that uses SHA512 optimization~~ Added AOR memchr implementation that uses instructions from aarch64's SHA3 ISA extension Apr 8, 2025

yfeldblum reviewed Apr 8, 2025

View reviewed changes

izard and others added 6 commits April 9, 2025 08:37

Addressed further code review comments in MemchrBenchmark.cpp

a62b779

Addressed last comment from code review.

358fd08

SHA512 instruction is part of SHA2, not SHA3

5bd669d

Fixed a linter error from code review

112fe02

Fixed Linux compilation bug, moved } out of #ifdef

5e534c9

Replaced sha3 with sha2 in cmake to reflect source change

18fb83a

yfeldblum reviewed Apr 9, 2025

View reviewed changes

izard changed the title ~~Added AOR memchr implementation that uses instructions from aarch64's SHA3 ISA extension~~ Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension Apr 10, 2025

yfeldblum reviewed May 12, 2025

View reviewed changes

10% perf improvement from re-arranging SHA512

7492140

10% perf improvement from re-arranging SHA512 daisy chain tree which enables using second half of result, sparing horizontal sum instruction.


		#include <cstring>

		#if !defined(__AVX2__) && !(defined(__linux__) && defined(__aarch64__))

Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension #2409

Are you sure you want to change the base?

Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension #2409

Conversation

izard commented Apr 1, 2025

[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

facebook-github-bot commented Apr 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

embg commented Apr 1, 2025 • edited Loading

izard commented Apr 2, 2025

embg commented Apr 2, 2025

embg commented Apr 2, 2025 • edited Loading

izard commented Apr 2, 2025 • edited Loading

izard commented Apr 2, 2025

============================================================================ [...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfeldblum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

embg commented Apr 7, 2025

facebook-github-bot commented Apr 7, 2025

embg commented Apr 8, 2025 • edited Loading

embg Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

embg commented Apr 9, 2025 • edited Loading

embg commented Apr 9, 2025

izard commented Apr 9, 2025 • edited Loading

facebook-github-bot commented Apr 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 19, 2025

embg commented Apr 1, 2025 •

edited

Loading

embg commented Apr 2, 2025 •

edited

Loading

izard commented Apr 2, 2025 •

edited

Loading

============================================================================
[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s

embg commented Apr 8, 2025 •

edited

Loading

embg Apr 8, 2025 •

edited

Loading

embg commented Apr 9, 2025 •

edited

Loading

izard commented Apr 9, 2025 •

edited

Loading