how does this crate compare to stringzilla? #159

BurntSushi · 2024-08-24T14:53:37Z

BurntSushi
Aug 24, 2024
Maintainer

StringZilla is an impressive project that provides many interesting string primitives with SIMD acceleration. Certainly much more than this crate. The scope of StringZilla is a fair bit broader than memchr. There are a number of claims made in the StringZilla README and related benchmarks. I wanted to go through them and provide some responses that contextualize the claims.

First, in the README (commit 07e0a2a):

Unlike memchr, the throughput of stringzilla is high in both normal and reverse-order searches. It also provides no constraints on the size of the character set, while memchr allows only 1, 2, or 3 characters.

I agree, the memchr crate specifically does not optimize the reverse case and provides only limited operations based on the size of the character set. The limitations are on the character set size are related to the types of SIMD algorithms used. Once you get to ~3-5 bytes, at least using memchr's SIMD technique, performance tends to drop-off.

With regard to reverse substring searches, I made a very intentional decision not to optimize that case with SIMD inside of memchr. The specific reason was that it's somewhat more niche to need fast reverse substring search. The cost of adding it is a lot more code. This is something I'd be open to adding if someone requested it, but I believe nobody has yet. If it were added, it would probably be opt-in to avoid the likely sizeable hit to compile times that would result from it. Note that memchr functions in this crate do support reverse SIMD acceleration since that's a bit more common (e.g., finding the extents of a line in a text file).

Unlike StringZilla though, the memchr crate has better worst case time complexity, including for reverse searches. StringZilla documents its library as providing O(h * n) worst case time for searching, but memchr provides O(h + n) worst case time complexity. This can matter quite a lot depending on your search queries.

As linked above, StringZilla also provides a separate repository for a targeted benchmark between StringZilla and the memchr crate. I find the headline of this repository to be remarkably misleading. "Up to 7x faster" gives a very false impression of the results. Moreover, what the benchmark is actually measuring is perhaps not what one would expect. Namely, it isn't just measuring "how long does it take to find a needle in a haystack." It's measuring how long it takes to find a collection of needles in the same haystack, and crucially, including searcher construction for each of those needles. So if, say, a substring implementation spent a lot more work up-front trying to build a fast searcher, then that could easily dominate the benchmark and mask the typical difference in throughput.

That's not to say that measuring searcher construction is invalid. But it should be one dimension of a good benchmark and it should absolutely be disclosed in the discussion of results.

Why does the benchmark include measurement of searcher construction and no benchmarks without it? That's hard to say precisely, but one possible answer is that StringZilla actually doesn't support it! If you look at its Rust API, it doesn't provide a way to build a searcher with a needle independent of the haystack. (Its MatcherType looks close, but it just dispatches to find(haystack, needle) internally.) This is in contrast to the memchr crate, which provides a memmem::Finder type that one can build and then re-use. I don't know if this is just because the Rust API is incomplete or if it's a fundamental limitation of StringZilla's API itself, but in my view, it's an API design shortcoming to not permit reusing searchers.

I discussed this a bit in a reddit comment as well, which includes interactions with the author of StringZilla where I bring this issue up. But from my perspective, this criticism was not well received.

Now of course, memchr has its own benchmark suite. And it is of considerable size that measures all sorts of different workloads. It is built inside the rebar harness. To run them locally, install rebar and then build the memchr and stringzilla runner programs. From the root of this repository:

$ rebar build -e '^rust/memchr/memchr/' -e stringzilla

To test measurements before capturing them, run:

$ rebar measure -e 'stringzilla|rust/memchr/memmem/(oneshot|prebuilt)' -t

This should complete successfully in reasonable time. If it fails, then something has gone wrong that needs to be debugged. Otherwise, run measurements and capture the results:

$ rebar measure -e 'stringzilla|rust/memchr/memmem/(oneshot|prebuilt)' | tee results.csv

Now we can rank them overall via the geometric mean of speed ratios recorded for each benchmark:

$ rebar rank tmp/stringzilla.csv --intersection
Engine                       Version  Geometric mean of speed ratios  Benchmark count
------                       -------  ------------------------------  ---------------
rust/memchr/memmem/prebuilt  2.7.4    1.19                            54
stringzilla/memmem/oneshot   3.9.3    1.42                            54
rust/memchr/memmem/oneshot   2.7.4    1.52                            54

The --intersection flag ensures we only include benchmarks for which all three engines have measurements.

But this result is already revealing: if one uses a prebuilt searcher from memchr, then it is on average faster than StringZilla. While one cannot always prebuild a searcher, there are lots of important cases where you can and where it can make a big difference.

We can also look at the benchmark results in more detail:

$ rebar cmp tmp/stringzilla.csv -e prebuilt -e zilla -t 1.2
benchmark                                                   rust/memchr/memmem/prebuilt  stringzilla/memmem/oneshot
---------                                                   ---------------------------  --------------------------
memmem/byterank/binary                                      4.4 GB/s (6.20x)             27.2 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  49.6 GB/s (1.00x)            31.8 GB/s (1.56x)
memmem/code/rust-library-never-fn-strength-paren            49.3 GB/s (1.00x)            36.8 GB/s (1.34x)
memmem/code/rust-library-never-fn-quux                      50.8 GB/s (1.00x)            38.5 GB/s (1.32x)
memmem/code/rust-library-common-fn                          26.4 GB/s (1.00x)            18.4 GB/s (1.44x)
memmem/code/rust-library-common-let                         20.1 GB/s (1.00x)            14.3 GB/s (1.41x)
memmem/pathological/md5-huge-last-hash                      47.3 GB/s (1.00x)            35.7 GB/s (1.33x)
memmem/pathological/rare-repeated-huge-match                1934.0 MB/s (1.00x)          563.7 MB/s (3.43x)
memmem/pathological/rare-repeated-small-tricky              28.3 GB/s (1.00x)            22.2 GB/s (1.27x)
memmem/pathological/rare-repeated-small-match               1920.8 MB/s (1.00x)          623.9 MB/s (3.08x)
memmem/pathological/defeat-simple-vector-alphabet           4.3 GB/s (10.28x)            44.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      20.6 GB/s (1.00x)            2.4 GB/s (8.51x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  1231.1 MB/s (36.17x)         43.5 GB/s (1.00x)
memmem/subtitles/common/huge-en-that                        36.2 GB/s (1.00x)            24.8 GB/s (1.46x)
memmem/subtitles/common/huge-en-you                         15.6 GB/s (1.00x)            10.6 GB/s (1.47x)
memmem/subtitles/common/huge-ru-that                        33.4 GB/s (1.00x)            18.3 GB/s (1.82x)
memmem/subtitles/common/huge-ru-not                         15.7 GB/s (1.00x)            3.5 GB/s (4.55x)
memmem/subtitles/common/huge-zh-that                        31.4 GB/s (1.00x)            24.6 GB/s (1.28x)
memmem/subtitles/common/huge-zh-do-not                      18.0 GB/s (1.00x)            14.2 GB/s (1.27x)
memmem/subtitles/never/huge-en-john-watson                  40.5 GB/s (1.00x)            31.9 GB/s (1.27x)
memmem/subtitles/never/huge-en-all-common-bytes             46.4 GB/s (1.00x)            38.0 GB/s (1.22x)
memmem/subtitles/never/teeny-en-john-watson                 1668.9 MB/s (1.00x)          1027.0 MB/s (1.62x)
memmem/subtitles/never/teeny-en-all-common-bytes            1780.2 MB/s (1.00x)          953.7 MB/s (1.87x)
memmem/subtitles/never/huge-ru-john-watson                  32.3 GB/s (1.31x)            42.3 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  49.9 GB/s (1.00x)            31.1 GB/s (1.61x)
memmem/subtitles/never/teeny-zh-john-watson                 1970.9 MB/s (1.00x)          1055.9 MB/s (1.87x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.4 GB/s (1.00x)            40.9 GB/s (1.26x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              1570.8 MB/s (1.00x)          953.7 MB/s (1.65x)
memmem/subtitles/rare/teeny-en-sherlock                     1335.1 MB/s (1.00x)          953.7 MB/s (1.40x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              2.2 GB/s (1.00x)             1251.7 MB/s (1.78x)
memmem/subtitles/rare/teeny-ru-sherlock                     1741.5 MB/s (1.00x)          785.4 MB/s (2.22x)
memmem/subtitles/rare/huge-zh-sherlock                      49.0 GB/s (1.00x)            34.1 GB/s (1.44x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1137.1 MB/s (1.00x)          844.7 MB/s (1.35x)
memmem/subtitles/rare/teeny-zh-sherlock                     1231.8 MB/s (1.00x)          657.0 MB/s (1.88x)

Notice how this crate is actually quite a bit faster than StringZilla on almost every benchmark when the searcher is prebuilt. (In many of these benchmarks, prebuilding the searcher doesn't matter because the haystack is so big. But we'll compare oneshot searching next.) The main cases where StringZilla is faster are pathological.

A oneshot comparison is more apples-to-apples, but like StringZilla's benchmark, it omits the speed improvements that come from prebuilding the searcher when that's possible:

[andrew@duff memchr]$ rebar cmp tmp/stringzilla.csv -e oneshot -e zilla -t 1.2
benchmark                                                   rust/memchr/memmem/oneshot  stringzilla/memmem/oneshot
---------                                                   --------------------------  --------------------------
memmem/byterank/binary                                      4.1 GB/s (6.59x)            27.2 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  49.3 GB/s (1.00x)           31.8 GB/s (1.55x)
memmem/code/rust-library-common-paren                       4.5 GB/s (1.30x)            5.9 GB/s (1.00x)
memmem/code/rust-library-common-let                         11.8 GB/s (1.21x)           14.3 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        46.7 GB/s (1.00x)           38.0 GB/s (1.23x)
memmem/pathological/md5-huge-last-hash                      46.5 GB/s (1.00x)           35.7 GB/s (1.30x)
memmem/pathological/rare-repeated-huge-match                458.6 MB/s (1.23x)          563.7 MB/s (1.00x)
memmem/pathological/rare-repeated-small-match               492.1 MB/s (1.27x)          623.9 MB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           4.3 GB/s (10.28x)           44.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      21.6 GB/s (1.00x)           2.4 GB/s (8.92x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  1232.5 MB/s (36.13x)        43.5 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         5.6 GB/s (1.88x)            10.6 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         7.4 GB/s (1.00x)            3.5 GB/s (2.13x)
memmem/subtitles/common/huge-zh-that                        19.7 GB/s (1.25x)           24.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      9.0 GB/s (1.58x)            14.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  51.3 GB/s (1.00x)           31.9 GB/s (1.61x)
memmem/subtitles/never/huge-en-all-common-bytes             47.4 GB/s (1.00x)           38.0 GB/s (1.25x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.37x)         1405.4 MB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  51.2 GB/s (1.00x)           42.3 GB/s (1.21x)
memmem/subtitles/never/teeny-ru-john-watson                 1213.8 MB/s (1.22x)         1483.5 MB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  49.8 GB/s (1.00x)           31.1 GB/s (1.60x)
memmem/subtitles/never/teeny-zh-john-watson                 799.0 MB/s (1.32x)          1055.9 MB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           40.9 GB/s (1.25x)
memmem/subtitles/rare/huge-en-medium-needle                 47.7 GB/s (1.00x)           39.7 GB/s (1.20x)
memmem/subtitles/rare/huge-ru-sherlock                      51.2 GB/s (1.00x)           42.5 GB/s (1.21x)
memmem/subtitles/rare/teeny-ru-sherlock                     1335.1 MB/s (1.00x)         785.4 MB/s (1.70x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           34.1 GB/s (1.46x)
memmem/subtitles/rare/teeny-zh-sherlock                     1137.1 MB/s (1.00x)         657.0 MB/s (1.73x)

Notice that in benchmarks with a large haystack and relatively low match frequency, memchr's oneshot approach is still faster than StringZilla. It's because these are benchmarks where throughput dominates. But in cases where searcher construction plays a larger role, StringZilla has the edge.

I have not studied the source code of StringZilla in detail, but memchr does not have higher searcher construction costs for lack of trying on my part or for no reason. I'm sure there's room for improvement of course, but the fact that StringZilla only guarantees O(h * n) worst case time complexity could be a root cause here. Since memchr guarantees a better worst case bound of O(h + n), it has to do more work up-front in at least some cases to deal with guaranteeing that worst case bound. There is basically more infrastructure in place to account for it. For example, memchr has a complete Two-Way implementation inside of it.

So, in summary, I think the StringZilla materials:

Overstate the performance benefits in comparison with memchr.
Omit very crucial details of the benefits of memchr (instead only listing the benefits of StringZilla without listing the costs).
Omit crucial details for contextualizing their benchmark results, and consequently, the results appear more general than they really are IMO.

Answered by BurntSushi

Aug 24, 2024

I answered the question in the OP.

View full answer

BurntSushi · 2024-08-24T14:53:53Z

BurntSushi
Aug 24, 2024
Maintainer Author

I answered the question in the OP.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how does this crate compare to stringzilla? #159

{{title}}

Replies: 1 comment

{{title}}

Select a reply

how does this crate compare to stringzilla? #159

BurntSushi Aug 24, 2024 Maintainer

Replies: 1 comment

BurntSushi Aug 24, 2024 Maintainer Author

BurntSushi
Aug 24, 2024
Maintainer

BurntSushi
Aug 24, 2024
Maintainer Author