-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Added AOR memchr implementation that uses instructions from aarch64's SHA2 ISA extension #2409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…racter search on ARM devices that support SHA512 instruction
… implementation based on optimized memchr from folly. This provides performance boost to two Like subbenchmarks. Depends on Folly PR: facebook/folly#2409
@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
folly/FollyMemchr.cpp
Outdated
|
||
#include <cstring> | ||
|
||
#if !defined(__AVX2__) && !(defined(__linux__) && defined(__aarch64__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an AVX2 implementation somewhere? If not, why are we disabling the fallback when defined(__AVX2__)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut&paste error, will remove the AVX fallback. I can no longer work on a better AVX2 implementation :)
folly/memchr_select_aarch64.cpp
Outdated
if (size < 128) { | ||
// Libc AARCH64 memchr is more efficient for short strings. | ||
// This dispatch is less efficient than including it as a case for small strings, | ||
// but can't do that because it of GPL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to fix comment typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait a second. are you saying that we can't patch AOR because of GPL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fastest code would be if I cut small string handling from GPLed GLIBC and paste into Apache AOR.
But as I can't copy GPLed source code into Apache source code, I have to call it explicitly for small strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me talk to some folks internally and figure out how we move forward here.
The regression for strings up to 32 bytes is pretty concerning. Are you confident it's real, not noise? I don't think we can merge this with the regression, because it will cause a large fraction of Can we patch the AOR implementations to get rid of this regression? |
It is certainly not noise, libc memcpy takes ~5 CPU cycles for short strings, and proposed new folly memcpy takes ~6 CPU cycles, which first seems like a huge 20% performance regression. However, using execution latency difference of 1 CPU cycle for performance analysis is misleading on OOO core. The real code sequence containing short string search will have few hundreds of u-ops in flight with OOO core utilizing ILP, and this extra CPU cycle might stay, might be masked or even multiplied with some probability, but the performance difference will be still within ~1%. To back this up, I changed the benchmark in folly to interleave memchr with short memcpy calls, and average performance regression on small strings went down to 4%, and when I added second memchr it went down to 1%, while large strings still shown ~2.2x performance uplift.
Are there any benchmarks in folly and/or velox that I can try that could demonstrate the effect on smaller strings? So far I ran all velox benchmarks with this change and I only see ~20% and ~25% improvement on two Like sub-benchmarks that were lagging on ARM, and no other difference.
Technically yes, but I don't like pasting GPL code into Apache code, even if it is just a sequence of 10 assembly instructions. |
So I discussed this with the cabal internally. We are really worried about regressing the short string case, because that's a really large chunk of We propose adding a new function Is this acceptable from your side? |
Long term, we would urge Nvidia / Arm to upstream the long string optimizations from AOR into glibc. |
Thank would work, I'll change the PRs.
Makes sense! |
…h will call __folly_memchr_aarch64_long on aarch64 and std::memchr on x86
Changed the Folly and Velox respective PRs based on the conversation above: now folly::memchr_long is the recommended way to search for a character in a string equal or longer than 128 bytes. As there is no dispatch one string length any more, perf difference for 1-32 bytes strings worsened: ============================================================================
|
CMakeLists.txt
Outdated
set_property( | ||
SOURCE | ||
${AOR_FILE} | ||
APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp" | ||
APPEND PROPERTY COMPILE_OPTIONS "-x" "assembler-with-cpp" "-march=armv8.2-a+crypto+sve2-sha3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean older Arm chips will not benefit from AOR implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8.2A is overkill, it is safe to change to 8A. I am not aware about older ARM CPUs with AARCH64 support than 8 family.
@@ -0,0 +1,29 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want a different file path. The other headers under folly/
don't follow the pattern Folly[Name].h
.
@yfeldblum can you advise on the right path for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h
. Reorganization later might be appropriate.
Alternatively, we can place this within folly/lang/CString.h
. That would be a suitable place for ports and variants of things which libc string.h
ships (like this large-buffer variant of memchr
and the existing port of BSD's strlcpy
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can change the path once decided. This comes from copying FollyMemset.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, we shouldn't have merged that :) might be too late to change now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfeldblum I vote for folly/lang/CString.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg We can always change where eg memset lives if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR overall. Requesting some tweaks around the edges.
folly/external/aor/memchr-advsimd.S
Outdated
* devices with h/w SHA512 instruction support. MTE compatibility TBD | ||
*/ | ||
|
||
ENTRY (__folly_memchr_aarch64_simd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include a description of the variant in the name - this is a long-string variant - __folly_memchr_long_aarch64_simd
.
@@ -0,0 +1,29 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg Thanks for flagging. For now this is alright initially, and aligns with folly/FollyMemcpy.h
. Reorganization later might be appropriate.
Alternatively, we can place this within folly/lang/CString.h
. That would be a suitable place for ports and variants of things which libc string.h
ships (like this large-buffer variant of memchr
and the existing port of BSD's strlcpy
).
folly/memchr_select_aarch64.cpp
Outdated
return __folly_memchr_aarch64; | ||
} | ||
|
||
[[gnu::ifunc("__folly_detail_memchr_long_resolve")]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for @embg - the PR would need an edit internally to hook up the Buck build rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait wait wait, why do we need ifunc magic for this PR? I don't see how memchr_long
is different than any other simd kernel, like a math kernel, where we don't use ifunc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, ifunc magic is no longer required as memchr is no longer redefined, I'll fix tomorrow my morning time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after second thought and reading your comment regarding older ARMs, this runtime dispatch still selects between SHA512 and generic AOR version, where SHA512 is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be implemented using ifunc. It can be either compile-time dispatch (using macro guards) or runtime dispatch (using cpuid).
ifunc is solving a very specific problem where other translation units call memset
and we want to replace that with folly memset.
In this case you have the ability to write a wrapper which dispatches correctly, so ifunc isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally we do compile with sha3
enabled, so compile-time is fine for us. If runtime dispatch isn't completely trivial to implement, I think compile-time dispatch is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed ifunc, added compile time dispatch.
folly/memchr_select_aarch64.cpp
Outdated
decltype(&__folly_memchr_aarch64) __folly_detail_memchr_long_resolve( | ||
uint64_t hwcaps, const void* arg2) { | ||
#if defined(_IFUNC_ARG_HWCAP) | ||
if (hwcaps & HWCAP_SHA3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for @embg - the PR would need a pass of clang-format
(via arc f
internally).
…hr versions as discussed in the PR comments
I will take another pass looking at the new changes tomorrow, if it looks good I will pull this in and work on merging. |
@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There's a very simple issue which was caught by internal CI. Should be trivial to fix.
Edit: I got someone to approve the workflows on this PR and they are now failing in OSS CI as well (see below). |
|
||
extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) { | ||
#ifdef FOLLY_ARM_FEATURE_SHA3 | ||
return __folly_memchr_long_aarch64_sha512(ptr, ch, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHA-512 usually refers to a variant of SHA-2. You can ctrl-f for "SHA-512" on this page to see what I mean: https://en.wikipedia.org/wiki/SHA-2
That's a bit confusing because we are talking about Arm's SHA-3 instructions, not SHA-2 instructions.
Since you have to fix the CI failure anyway, please change all SHA-512 references to SHA-3. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg I notice that the implementation uses the sha512h
instruction and the FEAT_SHA512
feature-set. I would have assumed that this actually refers to sha512 as vs sha3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I’m confused why we are gating with FOLLY_ARM_FEATURE_SHA3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to check for FOLLY_ARM_FEATURE_SHA3. Originally I experimented with BCAX and SHA1C which are part of SHA3, and later switched to SHA512 which is most efficient for the cause, but forgot to change the feature set. All ARMs I know that have SHA3 also have SHA2, but I am not sure if older Arms with SHA2 only support would have the 2/1 latency/throughput retiring for this instruction. I mean it will work on older Arms, but probably not as fast.
@@ -0,0 +1,29 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg We can always change where eg memset lives if we want to.
|
||
extern "C" void* memchr_long(const void* ptr, int ch, std::size_t count) { | ||
#ifdef FOLLY_ARM_FEATURE_SHA3 | ||
return __folly_memchr_long_aarch64_sha512(ptr, ch, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embg I notice that the implementation uses the sha512h
instruction and the FEAT_SHA512
feature-set. I would have assumed that this actually refers to sha512 as vs sha3?
volatile static uint64_t result_offset = 0; | ||
static FOLLY_NOINLINE void* std_memchr(const void *s, int c, size_t l) { | ||
result_offset = (uint64_t)std::memchr((void *)s, c, l)-(uint64_t)s; | ||
return (void*)((uint64_t)s + result_offset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally use folly::compiler_must_not_predict
and folly::compiler_must_not_elide
for this purpose. Also wrapped as folly::doNotPredict
and folly::doNotElide
in the benchmark library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification note: instead of volatile
I think this is very close to being ready, I am going to pull it in and will handle any remaining nits myself. |
@izard Last request I have from you. Could you take a look at why CMake is failing?
Example failing job: https://github.com/facebook/folly/actions/runs/14361949326/job/40268841547?pr=2409 I see that you did touch CMakeLists.txt. Is it possible that your change there broke CMake somehow? |
I don't have Windows to test, only Mac and Linux. But in both my envs I just manually installed Boost to satisfy this requirement. I did not add the Boost requirement, my only changes to Cmake are adding memchr test, benchmark, and source files; and changing asm compiler flags, this should not affect boost. So I think master version of folly won't compile in your environment either. |
@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
L(zero_length): | ||
mov result, #0 | ||
ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting this 2-instruction / 8-byte sequence, and replacing each of 6 occurrences with a 1-instruction / 4-byte sequence, reduces the size of the function by 4 instructions / 16 bytes overall.
However, this 2-instruction sequence is definitely in a different cacheline from each of the occurrences it replaces.
Has this tradeoff been analyzed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting observation. This only would matter for performance on very small inputs, and this function is only called from memchr_long.
10% perf improvement from re-arranging SHA512 daisy chain tree which enables using second half of result, sparing horizontal sum instruction.
@embg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There are already external/aor/memset and memcpy available.
This PR adds AOR memchr, patched to use SHA512 when available for ~2x performance boost on large strings character search.
This helps some velox benchmarks and presumably workloads. There is a cost of lower performance on very short strings.
build/folly/memchr_bench
[...]/folly/folly/test/MemchrBenchmark.cpp relative time/iter iters/s
std_memchr: size=1 1.51ns 662.53M
std_memchr: size=2 1.51ns 662.53M
std_memchr: size=4 1.51ns 662.32M
std_memchr: size=8 1.51ns 662.32M
std_memchr: size=16 1.51ns 662.53M
std_memchr: size=32 1.65ns 605.18M
std_memchr: size=64 2.13ns 470.25M
std_memchr: size=128 3.24ns 308.85M
std_memchr: size=256 5.87ns 170.31M
std_memchr: size=512 11.77ns 84.95M
std_memchr: size=1024 23.67ns 42.25M
std_memchr: size=2048 48.25ns 20.72M
std_memchr: size=4096 97.41ns 10.27M
std_memchr: size=8192 195.94ns 5.10M
std_memchr: size=16384 399.43ns 2.50M
std_memchr: size=32768 794.17ns 1.26M
folly_memchr: size=1 1.80ns 555.14M
folly_memchr: size=2 1.80ns 555.14M
folly_memchr: size=4 1.80ns 555.14M
folly_memchr: size=8 1.80ns 555.14M
folly_memchr: size=16 1.80ns 555.14M
folly_memchr: size=32 2.10ns 475.49M
folly_memchr: size=64 2.41ns 415.15M
folly_memchr: size=128 3.32ns 301.45M
folly_memchr: size=256 4.22ns 236.82M
folly_memchr: size=512 6.93ns 144.27M
folly_memchr: size=1024 12.06ns 82.89M
folly_memchr: size=2048 22.83ns 43.81M
folly_memchr: size=4096 44.89ns 22.28M
folly_memchr: size=8192 87.97ns 11.37M
folly_memchr: size=16384 174.78ns 5.72M
folly_memchr: size=32768 352.62ns 2.84M