add aarch64 SIMD implementation of Teddy #129

BurntSushi · 2023-09-17T23:57:50Z

Up until this point, Teddy was explicitly written using x86-64 SIMD routines. Specifically, ones from SSSE3 and AVX2. This PR shuffles Teddy's main implementation into code that is generic over a new Vector trait, and provides implementations of that Vector trait for x86-64's __m128i and __m256i, in addition to aarch64's u8x16_t vector type. In effect, this greatly speeds up searches for a small number of patterns automatically on aarch64 (i.e., on Apple's new M1 and M2 chips).

An ad hoc ripgrep benchmark is worth a thousand words. On my M2 mac mini:

$ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    8.196
user    7.726
sys     0.469
maxmem  5728 MB
faults  17

$ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    1.127
user    0.701
sys     0.425
maxmem  4880 MB
faults  13

This PR also drops criterion in favor of rebar for benchmarking, which is specialized to the task of regex/substring searching. In that vein, we can look at top-level AhoCorasick benchmarks before and after:

benchmark                      rust/old-aho-corasick/default/leftmost-first  rust/aho-corasick/default/leftmost-first
---------                      --------------------------------------------  ----------------------------------------
curated/sherlock-en            23.7 GB/s (1.00x)                             23.8 GB/s (1.00x)
curated/sherlock-casei-en      659.7 MB/s (12.21x)                           7.9 GB/s (1.00x)
curated/sherlock-ru            24.3 GB/s (1.00x)                             24.3 GB/s (1.00x)
curated/sherlock-casei-ru      4.5 GB/s (1.43x)                              6.5 GB/s (1.00x)
curated/sherlock-zh            28.9 GB/s (1.00x)                             28.9 GB/s (1.00x)
curated/alt-sherlock-en        659.7 MB/s (12.05x)                           7.8 GB/s (1.00x)
curated/alt-sherlock-casei-en  640.0 MB/s (4.36x)                            2.7 GB/s (1.00x)
curated/alt-sherlock-ru        659.8 MB/s (7.92x)                            5.1 GB/s (1.00x)
curated/alt-sherlock-casei-ru  604.0 MB/s (2.48x)                            1497.8 MB/s (1.00x)
curated/alt-sherlock-zh        663.1 MB/s (13.47x)                           8.7 GB/s (1.00x)
curated/dictionary-15          172.7 MB/s (1.05x)                            181.1 MB/s (1.00x)
sherlock/name-alt1             29.4 GB/s (1.01x)                             29.8 GB/s (1.00x)
sherlock/name-alt2             10.7 GB/s (1.00x)                             9.8 GB/s (1.08x)
sherlock/name-alt3             652.7 MB/s (10.47x)                           6.7 GB/s (1.00x)
sherlock/name-alt4             11.1 GB/s (1.00x)                             10.8 GB/s (1.03x)
sherlock/name-alt5             6.9 GB/s (1.01x)                              7.0 GB/s (1.00x)
sherlock/name-alt6             83.1 GB/s (1.00x)                             83.1 GB/s (1.00x)
sherlock/name-alt7             46.3 GB/s (1.00x)                             46.3 GB/s (1.00x)
sherlock/name-nocase1          637.2 MB/s (2.55x)                            1623.7 MB/s (1.00x)
sherlock/name-nocase2          643.7 MB/s (9.10x)                            5.7 GB/s (1.00x)
sherlock/name-nocase3          641.9 MB/s (3.21x)                            2.0 GB/s (1.00x)
sherlock/words5000             145.8 MB/s (1.01x)                            147.7 MB/s (1.00x)

Basically, there are 2-10x improvements across the board. These primarily apply to throughput where you expect matches to occur relatively rarely with respect to the size of the haystack.

For x86_64, there might be some small latency improvements. And there were a few tweaks to the various prefilter heuristics uses. But one should generally expect comparable performance to what came before this PR. If you notice any meaningful regressions, please open a new issue with enough detail for me to reproduce the problem.

This PR also makes it possible for Teddy to be pretty easily ported to other vector types as well. I took a look at wasm and it's not obvious that it has the right routines to make it work, but I probably spent all of 10 minutes doing a quick skim. I'm not a wasm expert, so if anyone has a good handle on wasm32 SIMD, you might try your hand at implementing the Vector trait. (If you need help, please open an issue.)

(I do hope to get a new ripgrep release out soon with this improvement and an analogous improvement to aarch64 in the memchr crate.)

I botched the memchr 2.6 MSRV because it actually requires Rust 1.61 and not Rust 1.60. This crate's MSRV is Rust 1.60, so pin memchr to a version that works on Rust 1.60 (for x86-64 at least). Ref rust-lang/regex#1081

A subsequent commit will remove the Criterion benchmarks. This essentially makes benchmarking homogenous now between the regex, aho-corasick and memchr crates. For now, we start with the benchmarks that were defined for Criterion. Will refine these over time. We also start with a naive multi-substring algorithm and the daachorse crate too.

We have switched to rebar.

We're going to replace it with a Vector trait that can be implemented for multiple vector types. Similar to what was done with memchr[1]. Before doing that, we move the old stuff aside. [1]: https://github.com/BurntSushi/memchr/blob/f6188dbbca2b529100852e4509d48d2d002a674a/src/vector.rs

While this does technically rewrite Teddy, we don't really do any core changes to how it previously worked. We mostly just shuffle and re-organize code so that it's written to use a generic vector type instead of explicitly specialized to __m128i and __m256i. We also use this opportunity to introduce a sprinkling of const generics, which helps reduce code duplication even more. We also switch from an enum for dispatching between Teddy variants to dynamic dispatch via a trait. Benchmarks suggest there really isn't any meaningful difference here, and I kind of prefer the dynamic dispatch route for difficult to explain reasons. But I might waffle on this. And of course, the point of the exercise, we introduce an implementation of the Vector trait for `u8x16_t` on `aarch64`. Kudos to the sse2neon[1] project for making that port much faster than it would have been. [1]: https://github.com/DLTcollab/sse2neon

itamarst · 2023-09-18T00:39:08Z

Out of curiosity, why not use something like the wide crate?

BurntSushi · 2023-09-18T00:45:37Z

@itamarst a few reasons:

I don't take dependencies lightly. It has to be extremely compelling. (My standard here is very intentionally impossibly high for core crates like aho-corasick.)
From a quick glance, the wide crate does not provide the operations necessary for Teddy. The critical ops are shuffles and palignr.
The wide crate works via safe_arch and that in turn only uses compile-time knowledge of what SIMD instructions are available. This is critically inappropriate in pretty much all cases except for when you own the compile step for all your users. It would mean, for example, that most users of ripgrep wouldn't benefit from SIMD optimizations such as Teddy. In this PR (and before), SIMD support is detected at runtime, regardless of what options you compile the crate with. (Now technically, NEON is part of aarch64, so safe_arch would be appropriate in that specific case, but this doesn't mitigate (1) and (2) above. And since (3) applies to x86-64, there's no real benefit to using wide even if this was the only concern.)

This brings in aarch64 SIMD support for Teddy[1]. In effect, it means searches that are multiple (but a small number of) literals extracted will likely get much faster on aarch64 (i.e., Apple silicon). For example, from the PR, on my M2 mac mini: $ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 8.196 user 7.726 sys 0.469 maxmem 5728 MB faults 17 $ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en 3055 real 1.127 user 0.701 sys 0.425 maxmem 4880 MB faults 13 w00t. [1]: BurntSushi/aho-corasick#129

BurntSushi added 6 commits September 4, 2023 15:04

ci: pin to memchr 2.6.2 for MSRV CI job

a9bc749

I botched the memchr 2.6 MSRV because it actually requires Rust 1.61 and not Rust 1.60. This crate's MSRV is Rust 1.60, so pin memchr to a version that works on Rust 1.60 (for x86-64 at least). Ref rust-lang/regex#1081

bench: remove Criterion benchmarks

2d6422f

We have switched to rebar.

benchmarks: add initial aarch64 measurements

c19f214

benchmarks: add initial x86_64 measurements

687a848

BurntSushi force-pushed the ag/aarch64 branch 2 times, most recently from 1c71ca6 to 7134d5c Compare September 18, 2023 00:10

BurntSushi force-pushed the ag/aarch64 branch from 7134d5c to 6e1aeab Compare September 18, 2023 00:20

BurntSushi merged commit 0be6fe4 into master Sep 18, 2023
12 checks passed

BurntSushi deleted the ag/aarch64 branch September 18, 2023 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add aarch64 SIMD implementation of Teddy #129

add aarch64 SIMD implementation of Teddy #129

BurntSushi commented Sep 17, 2023 •

edited

Loading

itamarst commented Sep 18, 2023

BurntSushi commented Sep 18, 2023 •

edited

Loading

add aarch64 SIMD implementation of Teddy #129

add aarch64 SIMD implementation of Teddy #129

Conversation

BurntSushi commented Sep 17, 2023 • edited Loading

itamarst commented Sep 18, 2023

BurntSushi commented Sep 18, 2023 • edited Loading

BurntSushi commented Sep 17, 2023 •

edited

Loading

BurntSushi commented Sep 18, 2023 •

edited

Loading