Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add aarch64 SIMD implementation of Teddy #129

Merged
merged 7 commits into from
Sep 18, 2023
Merged

add aarch64 SIMD implementation of Teddy #129

merged 7 commits into from
Sep 18, 2023

Conversation

BurntSushi
Copy link
Owner

@BurntSushi BurntSushi commented Sep 17, 2023

Up until this point, Teddy was explicitly written using x86-64 SIMD routines. Specifically, ones from SSSE3 and AVX2. This PR shuffles Teddy's main implementation into code that is generic over a new Vector trait, and provides implementations of that Vector trait for x86-64's __m128i and __m256i, in addition to aarch64's u8x16_t vector type. In effect, this greatly speeds up searches for a small number of patterns automatically on aarch64 (i.e., on Apple's new M1 and M2 chips).

An ad hoc ripgrep benchmark is worth a thousand words. On my M2 mac mini:

$ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    8.196
user    7.726
sys     0.469
maxmem  5728 MB
faults  17

$ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    1.127
user    0.701
sys     0.425
maxmem  4880 MB
faults  13

This PR also drops criterion in favor of rebar for benchmarking, which is specialized to the task of regex/substring searching. In that vein, we can look at top-level AhoCorasick benchmarks before and after:

benchmark                      rust/old-aho-corasick/default/leftmost-first  rust/aho-corasick/default/leftmost-first
---------                      --------------------------------------------  ----------------------------------------
curated/sherlock-en            23.7 GB/s (1.00x)                             23.8 GB/s (1.00x)
curated/sherlock-casei-en      659.7 MB/s (12.21x)                           7.9 GB/s (1.00x)
curated/sherlock-ru            24.3 GB/s (1.00x)                             24.3 GB/s (1.00x)
curated/sherlock-casei-ru      4.5 GB/s (1.43x)                              6.5 GB/s (1.00x)
curated/sherlock-zh            28.9 GB/s (1.00x)                             28.9 GB/s (1.00x)
curated/alt-sherlock-en        659.7 MB/s (12.05x)                           7.8 GB/s (1.00x)
curated/alt-sherlock-casei-en  640.0 MB/s (4.36x)                            2.7 GB/s (1.00x)
curated/alt-sherlock-ru        659.8 MB/s (7.92x)                            5.1 GB/s (1.00x)
curated/alt-sherlock-casei-ru  604.0 MB/s (2.48x)                            1497.8 MB/s (1.00x)
curated/alt-sherlock-zh        663.1 MB/s (13.47x)                           8.7 GB/s (1.00x)
curated/dictionary-15          172.7 MB/s (1.05x)                            181.1 MB/s (1.00x)
sherlock/name-alt1             29.4 GB/s (1.01x)                             29.8 GB/s (1.00x)
sherlock/name-alt2             10.7 GB/s (1.00x)                             9.8 GB/s (1.08x)
sherlock/name-alt3             652.7 MB/s (10.47x)                           6.7 GB/s (1.00x)
sherlock/name-alt4             11.1 GB/s (1.00x)                             10.8 GB/s (1.03x)
sherlock/name-alt5             6.9 GB/s (1.01x)                              7.0 GB/s (1.00x)
sherlock/name-alt6             83.1 GB/s (1.00x)                             83.1 GB/s (1.00x)
sherlock/name-alt7             46.3 GB/s (1.00x)                             46.3 GB/s (1.00x)
sherlock/name-nocase1          637.2 MB/s (2.55x)                            1623.7 MB/s (1.00x)
sherlock/name-nocase2          643.7 MB/s (9.10x)                            5.7 GB/s (1.00x)
sherlock/name-nocase3          641.9 MB/s (3.21x)                            2.0 GB/s (1.00x)
sherlock/words5000             145.8 MB/s (1.01x)                            147.7 MB/s (1.00x)

Basically, there are 2-10x improvements across the board. These primarily apply to throughput where you expect matches to occur relatively rarely with respect to the size of the haystack.

For x86_64, there might be some small latency improvements. And there were a few tweaks to the various prefilter heuristics uses. But one should generally expect comparable performance to what came before this PR. If you notice any meaningful regressions, please open a new issue with enough detail for me to reproduce the problem.

This PR also makes it possible for Teddy to be pretty easily ported to other vector types as well. I took a look at wasm and it's not obvious that it has the right routines to make it work, but I probably spent all of 10 minutes doing a quick skim. I'm not a wasm expert, so if anyone has a good handle on wasm32 SIMD, you might try your hand at implementing the Vector trait. (If you need help, please open an issue.)

(I do hope to get a new ripgrep release out soon with this improvement and an analogous improvement to aarch64 in the memchr crate.)

I botched the memchr 2.6 MSRV because it actually requires Rust 1.61
and not Rust 1.60. This crate's MSRV is Rust 1.60, so pin memchr to a
version that works on Rust 1.60 (for x86-64 at least).

Ref rust-lang/regex#1081
A subsequent commit will remove the Criterion benchmarks.

This essentially makes benchmarking homogenous now between the regex,
aho-corasick and memchr crates.

For now, we start with the benchmarks that were defined for Criterion.
Will refine these over time. We also start with a naive multi-substring
algorithm and the daachorse crate too.
We have switched to rebar.
We're going to replace it with a Vector trait that can be implemented
for multiple vector types. Similar to what was done with memchr[1].

Before doing that, we move the old stuff aside.

[1]: https://github.com/BurntSushi/memchr/blob/f6188dbbca2b529100852e4509d48d2d002a674a/src/vector.rs
@BurntSushi BurntSushi force-pushed the ag/aarch64 branch 2 times, most recently from 1c71ca6 to 7134d5c Compare September 18, 2023 00:10
While this does technically rewrite Teddy, we don't really do any core
changes to how it previously worked. We mostly just shuffle and
re-organize code so that it's written to use a generic vector type
instead of explicitly specialized to __m128i and __m256i. We also use
this opportunity to introduce a sprinkling of const generics, which
helps reduce code duplication even more.

We also switch from an enum for dispatching between Teddy variants to
dynamic dispatch via a trait. Benchmarks suggest there really isn't any
meaningful difference here, and I kind of prefer the dynamic dispatch
route for difficult to explain reasons. But I might waffle on this.

And of course, the point of the exercise, we introduce an
implementation of the Vector trait for `u8x16_t` on `aarch64`. Kudos to
the sse2neon[1] project for making that port much faster than it would
have been.

[1]: https://github.com/DLTcollab/sse2neon
@itamarst
Copy link

Out of curiosity, why not use something like the wide crate?

@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 18, 2023

@itamarst a few reasons:

  1. I don't take dependencies lightly. It has to be extremely compelling. (My standard here is very intentionally impossibly high for core crates like aho-corasick.)
  2. From a quick glance, the wide crate does not provide the operations necessary for Teddy. The critical ops are shuffles and palignr.
  3. The wide crate works via safe_arch and that in turn only uses compile-time knowledge of what SIMD instructions are available. This is critically inappropriate in pretty much all cases except for when you own the compile step for all your users. It would mean, for example, that most users of ripgrep wouldn't benefit from SIMD optimizations such as Teddy. In this PR (and before), SIMD support is detected at runtime, regardless of what options you compile the crate with. (Now technically, NEON is part of aarch64, so safe_arch would be appropriate in that specific case, but this doesn't mitigate (1) and (2) above. And since (3) applies to x86-64, there's no real benefit to using wide even if this was the only concern.)

@BurntSushi BurntSushi merged commit 0be6fe4 into master Sep 18, 2023
12 checks passed
@BurntSushi BurntSushi deleted the ag/aarch64 branch September 18, 2023 13:20
BurntSushi added a commit to BurntSushi/ripgrep that referenced this pull request Sep 18, 2023
This brings in aarch64 SIMD support for Teddy[1]. In effect, it means
searches that are multiple (but a small number of) literals extracted
will likely get much faster on aarch64 (i.e., Apple silicon). For
example, from the PR, on my M2 mac mini:

    $ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
    3055

    real    8.196
    user    7.726
    sys     0.469
    maxmem  5728 MB
    faults  17

    $ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
    3055

    real    1.127
    user    0.701
    sys     0.425
    maxmem  4880 MB
    faults  13

w00t.

[1]: BurntSushi/aho-corasick#129
ink-splatters pushed a commit to ink-splatters/ripgrep that referenced this pull request Oct 25, 2023
This brings in aarch64 SIMD support for Teddy[1]. In effect, it means
searches that are multiple (but a small number of) literals extracted
will likely get much faster on aarch64 (i.e., Apple silicon). For
example, from the PR, on my M2 mac mini:

    $ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
    3055

    real    8.196
    user    7.726
    sys     0.469
    maxmem  5728 MB
    faults  17

    $ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
    3055

    real    1.127
    user    0.701
    sys     0.425
    maxmem  4880 MB
    faults  13

w00t.

[1]: BurntSushi/aho-corasick#129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants