-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add aarch64 SIMD implementation of Teddy #129
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I botched the memchr 2.6 MSRV because it actually requires Rust 1.61 and not Rust 1.60. This crate's MSRV is Rust 1.60, so pin memchr to a version that works on Rust 1.60 (for x86-64 at least). Ref rust-lang/regex#1081
A subsequent commit will remove the Criterion benchmarks. This essentially makes benchmarking homogenous now between the regex, aho-corasick and memchr crates. For now, we start with the benchmarks that were defined for Criterion. Will refine these over time. We also start with a naive multi-substring algorithm and the daachorse crate too.
We have switched to rebar.
We're going to replace it with a Vector trait that can be implemented for multiple vector types. Similar to what was done with memchr[1]. Before doing that, we move the old stuff aside. [1]: https://github.com/BurntSushi/memchr/blob/f6188dbbca2b529100852e4509d48d2d002a674a/src/vector.rs
BurntSushi
force-pushed
the
ag/aarch64
branch
2 times, most recently
from
September 18, 2023 00:10
1c71ca6
to
7134d5c
Compare
While this does technically rewrite Teddy, we don't really do any core changes to how it previously worked. We mostly just shuffle and re-organize code so that it's written to use a generic vector type instead of explicitly specialized to __m128i and __m256i. We also use this opportunity to introduce a sprinkling of const generics, which helps reduce code duplication even more. We also switch from an enum for dispatching between Teddy variants to dynamic dispatch via a trait. Benchmarks suggest there really isn't any meaningful difference here, and I kind of prefer the dynamic dispatch route for difficult to explain reasons. But I might waffle on this. And of course, the point of the exercise, we introduce an implementation of the Vector trait for `u8x16_t` on `aarch64`. Kudos to the sse2neon[1] project for making that port much faster than it would have been. [1]: https://github.com/DLTcollab/sse2neon
BurntSushi
force-pushed
the
ag/aarch64
branch
from
September 18, 2023 00:20
7134d5c
to
6e1aeab
Compare
Out of curiosity, why not use something like the |
@itamarst a few reasons:
|
BurntSushi
added a commit
to BurntSushi/ripgrep
that referenced
this pull request
Sep 18, 2023
This brings in aarch64 SIMD support for Teddy[1]. In effect, it means searches that are multiple (but a small number of) literals extracted will likely get much faster on aarch64 (i.e., Apple silicon). For example, from the PR, on my M2 mac mini: $ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 8.196 user 7.726 sys 0.469 maxmem 5728 MB faults 17 $ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 1.127 user 0.701 sys 0.425 maxmem 4880 MB faults 13 w00t. [1]: BurntSushi/aho-corasick#129
ink-splatters
pushed a commit
to ink-splatters/ripgrep
that referenced
this pull request
Oct 25, 2023
This brings in aarch64 SIMD support for Teddy[1]. In effect, it means searches that are multiple (but a small number of) literals extracted will likely get much faster on aarch64 (i.e., Apple silicon). For example, from the PR, on my M2 mac mini: $ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 8.196 user 7.726 sys 0.469 maxmem 5728 MB faults 17 $ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 1.127 user 0.701 sys 0.425 maxmem 4880 MB faults 13 w00t. [1]: BurntSushi/aho-corasick#129
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Up until this point, Teddy was explicitly written using
x86-64
SIMD routines. Specifically, ones from SSSE3 and AVX2. This PR shuffles Teddy's main implementation into code that is generic over a newVector
trait, and provides implementations of thatVector
trait forx86-64
's__m128i
and__m256i
, in addition toaarch64
'su8x16_t
vector type. In effect, this greatly speeds up searches for a small number of patterns automatically onaarch64
(i.e., on Apple's new M1 and M2 chips).An ad hoc ripgrep benchmark is worth a thousand words. On my M2 mac mini:
This PR also drops criterion in favor of
rebar
for benchmarking, which is specialized to the task of regex/substring searching. In that vein, we can look at top-levelAhoCorasick
benchmarks before and after:Basically, there are 2-10x improvements across the board. These primarily apply to throughput where you expect matches to occur relatively rarely with respect to the size of the haystack.
For
x86_64
, there might be some small latency improvements. And there were a few tweaks to the various prefilter heuristics uses. But one should generally expect comparable performance to what came before this PR. If you notice any meaningful regressions, please open a new issue with enough detail for me to reproduce the problem.This PR also makes it possible for Teddy to be pretty easily ported to other vector types as well. I took a look at wasm and it's not obvious that it has the right routines to make it work, but I probably spent all of 10 minutes doing a quick skim. I'm not a wasm expert, so if anyone has a good handle on wasm32 SIMD, you might try your hand at implementing the
Vector
trait. (If you need help, please open an issue.)(I do hope to get a new ripgrep release out soon with this improvement and an analogous improvement to
aarch64
in thememchr
crate.)