Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beat ripgrep on all benchmarks, permanently and for all time #102

Open
gvansickle opened this issue Oct 2, 2016 · 21 comments
Open

Beat ripgrep on all benchmarks, permanently and for all time #102

gvansickle opened this issue Oct 2, 2016 · 21 comments

Comments

@gvansickle
Copy link
Owner

http://blog.burntsushi.net/ripgrep/

While we already win a few, and hold our own on others for the most part, @BurntSushi has found a few bugs and does actually beat us in some cases.

OH NOW IT IS ON! ;-)

@refi64
Copy link

refi64 commented Oct 2, 2016

+1 for the "This Time It's Personal" milestone. ;)

@BurntSushi
Copy link

BurntSushi commented Oct 2, 2016

Haha nice, best of luck to ya!

I'll try to highlight some of the most important points from my writeup as they related to ucg:

  1. ucg is really fast. It is roughly competitive with ripgrep on simple literals. I suspect this is in large part due to PCRE2's JIT, which I know to be quite fast. (I benchmark Rust's regex library against PCRE2's JIT, and there's no obvious winner, although there exist some benchmarks where one is faster than the other.)
  2. Benchmarking ucg against rg is a little tricky, since ucg is a whitelist-only tool. This means you'll need to run rg with the -u -tall flags to approximate what ucg does by default. (The whitelists aren't exactly the same, so you might end up with different results occasionally, but it's close enough.) Additionally, you'll need to make sure rg counts lines, since I don't think line counting can be turned off in ucg.
  3. If you compile ucg with -march=native, then you should do the same with ripgrep. You'll need a Rust nightly compiler, and then you can run RUSTFLAGS="-C target-cpu=native" cargo build --release --features simd-accel. (Note that the benchmarks reported in my writeup were done using the ripgrep 0.1.2 binary on Github, which was compiled with -C target-feature=+ssse3. This notably misses a POPCNT optimization that target-cpu=native will enable, which matters on benchmarks that count lines.)
  4. In order to be comparable on a few benchmarks, you'll probably need to enable Unicode support in PCRE2. But this will mean that you can't use PCRE2_NEVER_UTF | PCRE2_NEVER_UCP. You also probably can't use PCRE2_NO_UTF_CHECK either. According to the PCRE2 docs:

When you call pcre2_match(), as well as testing for invalid options, a number of other sanity checks are performed on the arguments. For example, if the subject pointer is NULL, an immediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined.

It will be interesting to see how much of a performance hit this entails!

Also, I'll be adding a parallel directory iterator (hopefully soon), so I've still got a few more tricks left up my sleeve. :-)

@gvansickle
Copy link
Owner Author

Hey @BurntSushi ,

Thanks again for the benchmarks and your excellent writeup thereof.

1 . ucg is really fast.

Aw shucks, I hate to blow my own horn, so it's especially nice when someone does it for me ;-). Seriously, like you say, much/most of the credit goes to PCRE/PCRE2's JIT. In my own testing as well, there's a big hit if JIT isn't enabled.

2 . Benchmarking ucg against rg is a little tricky, [...]

Definitely; benchmarking all these tools (ucg, rg, ag, ack, GNU grep, BSD grep, pcre[2]grep) against each other in any kind of meaningful manner is a real challenge, due to the many differences you've cited better than I can reproduce here. That's been the number one time-consumer for me recently, as I've tried to get comparisons across not just different use cases (~regex+corpus), but across dimensions such as:

  • OS (kernel/userland/versions, where even ostensibly similar configurations require adaptation)
  • Hardware
    • VM/real
    • Number of cores/cpus
    • ISA extensions
  • Etc.

3 . If you compile ucg with -march=native, then you should do the same with ripgrep.

Indeed. However, I usually don't use a --march= line (though it is supported as of #95 via some herculean autotools trickery). This is mainly because when I go to build binary RPMs or DEBs for distribution, the resulting binaries need to be least-common-denominator, i.e. --march=x86_64. So beyond SSE2, everything else has to be sorted out at load- or run-time. Also, my profiling to date shows that the two major time sinks are PCRE{2}'s regex matching (which, since I'm at the mercy of each distro's build of the *.so, I can't do much about) and directory tree traversal (which again quickly becomes library-limited and/or I/O-bound), so so far I haven't seen much need for tuning the full app for a specific architecture, just the proverbial 20% that takes up 80% of the cycles.

[...] -C target-feature=+ssse3. This notably misses a POPCNT optimization that target-cpu=native will enable, which matters on benchmarks that count lines.

Indeed, though you may be interested in seeing (and I think you already have) how I'm handling this same issue in FileScanner_sse4_2.cpp::popcount16(). Kernighan's algorithm does a pretty respectable O(n) job in the number of set bits, and on x86 compiles down to... let me check... about 4 instructions including the loop's JNE. POPCNT still wins, but this seems to be a pretty close second.

Also, if you haven't seen it already, you may want to read my treatise on the POPCNT subject entitled "Portability Pointers: The POPCNT Instruction: Not Portable Unless It Is, And Then Only Sometimes". Basically, virtualization (VirtualBox in my tale, but also valgrind and certainly others) can play havoc with what CPUID claims is supported, and with what really is supported, and the two may disagree. Hopefully Rust handles this for you, but that one was a real head-scratcher for me for a while.

4 . In order to be comparable on a few benchmarks, you'll probably need to enable Unicode support in PCRE2.

Yeah, I've been meaning to see what happens if I turn on UTF-8. Probably time to start some experimentation.

Good to hear from you, and thanks again,

Gary R. Van Sickle

@genivia-inc
Copy link

genivia-inc commented Sep 30, 2019

Love this conversation! Great grep implementations to use and compare!

This got me motivated to spent a couple of weekends coding in C++11 to build a new portable tool GitHub ugrep. It has full support for Unicode. Its faster than ripgrep and GNU grep, at least for practical use cases with large files (13GB text file, 100MB Wikipedia and 35MB source code) and with lots of small files (Qt source). There is still room for improvement by tuning the algos and by using SIMD/AVX which isn't yet the case in this early version of ugrep v1.5. With Unicode support and many other new easy-to-use features it is a good start IMHO.

Love to hear from you what you think isn't so great and should be improved.

@BurntSushi
Copy link

@genivia-inc Neat tool. Looks like you put a lot of work into it. Unfortunately, I have some concerns with the way you're comparing it to other tools.

Firstly, I don't see any way to reproduce your benchmarks. Maybe I skimmed too quickly, but I don't see anything that discusses what actual commands were run, what corpora were searched and what patterns were used.

Secondly, you're benchmarking a pretty old version of ripgrep. ripgrep 0.10.0 was released over a year ago, but ugrep 1.5.0 was released 3 hours ago. You'd ideally run your benchmarks against ripgrep master, which has an unreleased performance bug fix: BurntSushi/ripgrep#1335

Thirdly, it is not easy to build your tool:

$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports the include directive... yes (GNU style)
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for REFLEX_DBGOUT_ in -lreflex... no
ugrep requires RE/flex 1.4.3 or greater: https://github.com/Genivia/RE-flex

$ make
make: *** No targets specified and no makefile found.  Stop.

Please consider making it easier to compile your tool. If your install instructions require me to run sudo (without using my system's package manager), then I'm not going to bother.

Fourthly, you make weird claims like the fact that Boost.Regex is faster than PCRE, which is just... not even remotely true in my experience. To me, this suggests you either have a bug in your measurement process or your benchmarks are deeply biased.

@genivia-inc
Copy link

genivia-inc commented Sep 30, 2019

As you know, you do not need sudo. Download RE/flex and build the library locally with ./build.sh.

The A's to your Q's:

  1. The README is updated to include the corpora I used for testing, as well as the non-sudo build steps.
  2. I'll take a look at ripgrep 1.5. I've developed a new method for searching that is implemented in my RE/flex DFA matcher, without using Aho Corasick etc which I found too slow in almost all cases. This new method is academic work that is forthcoming for publication. That's how it is so fast!
  3. As I mentioned, installing RE/flex is pretty easy. RE/flex is updated when I find a way to make ugrep faster.
  4. You can download the tests (see GitHub RE/flex that show that normally PCRE (w/o JIT) is slower than Boost.Regex and see for yourself. It shows that the test is not buggy, which is kind of bizarre to suggest.

As for your "weird claims" comments...? This is thoroughly tested and there are thousands of users of RE/flex and folks have replicated these results based on the files that are available.

@BurntSushi
Copy link

BurntSushi commented Sep 30, 2019

which is kind of bizarre to suggest

Not really. I have my own set of benchmarks that suggest otherwise: https://github.com/rust-lang/regex/tree/master/bench I don't usually advertise those as they don't come with analysis, but you can compare the numbers the last time I ran my benchmarks against boost vs PCRE2. It's really not even close on most benchmarks. Others have drawn similar conclusions.

Anyway, the thing left for me to do is to dig into your benchmark. Not sure when I'll have time to do that. But I don't see a lot of details on how your benchmark is run in your RE/flex README... You just vaguely describe the setup, but I don't know how to reproduce them.

The README is updated to include the corpora I used for testing, as well as the non-sudo build steps.

Thanks. Yeah, a lot of your benchmarks are just about searching a lot of literals at once. ripgrep 0.10.0 was missing an important optimization there. It now just uses Aho-Corasick (or something better if it's a small number of literals) and beats ugrep on my machine:

$ time rg-0.10.0 --no-config -onF -f words1+1000 enwik8 > /tmp/output

real    2.381
user    2.344
sys     0.030
maxmem  116 MB
faults  0

$ time rg --no-config -onF -f words1+1000 enwik8 > /tmp/output

real    0.552
user    0.541
sys     0.010
maxmem  105 MB
faults  0

$ time ~/clones/ugrep/ugrep -onF -f words1+1000 enwik8 > /tmp/output

real    1.025
user    0.997
sys     0.027
maxmem  101 MB
faults  0

I haven't tried all of them, but ripgrep and ugrep appear to be comparable on T-7 for example:

$ time rg --no-config -onF -f words8+1000 enwik8 > /tmp/output

real    0.340
user    0.326
sys     0.013
maxmem  106 MB
faults  0

$ time rg-0.10.0 --no-config -onF -f words8+1000 enwik8 > /tmp/output

real    2.462
user    2.433
sys     0.027
maxmem  124 MB
faults  0

$ time ~/clones/ugrep/ugrep -onF -f words8+1000 enwik8 > /tmp/output

real    0.361
user    0.344
sys     0.017
maxmem  101 MB
faults  0

Your T-2 benchmark is interesting. Honestly, the file is so small that there is likely a lot of noise in the results. But I'm seeing a bigger difference on my system:

$ time rg --no-config -on 'serialize_[a-zA-Z0-9_]+Type' big.cpp > /tmp/output

real    0.035
user    0.024
sys     0.010
maxmem  37 MB
faults  0

$ time rg-0.10.0 --no-config -on 'serialize_[a-zA-Z0-9_]+Type' big.cpp > /tmp/output

real    0.039
user    0.031
sys     0.007
maxmem  40 MB
faults  0

$ time ~/clones/ugrep/ugrep -onE 'serialize_[a-zA-Z0-9_]+Type' big.cpp > /tmp/output

real    0.083
user    0.076
sys     0.007
maxmem  36 MB
faults  0

Such a small file is hard to benchmark reliably. If we expand the file 50 times:

$ for ((i=0;i<50;i++)); do cat big.cpp; done > verybig.cpp

then the difference is much more noticeable:

$ time rg --no-config -on 'serialize_[a-zA-Z0-9_]+Type' verybig.cpp > /tmp/output

real    0.799
user    0.705
sys     0.093
maxmem  1728 MB
faults  0

$ time rg-0.10.0 --no-config -on 'serialize_[a-zA-Z0-9_]+Type' verybig.cpp > /tmp/output

real    0.840
user    0.741
sys     0.096
maxmem  1728 MB
faults  0

$ time ~/clones/ugrep/ugrep -onE 'serialize_[a-zA-Z0-9_]+Type' verybig.cpp > /tmp/output

real    2.941
user    2.648
sys     0.290
maxmem  12 MB
faults  0

I also can't reproduce your T-8 benchmark while running on the v5.9.2 checkout of https://github.com/qt/qt5:

$ time rg --no-config -o '#[[:space:]]*include[[:space:]]+"[^"]+"' -g '*.h' -g '*.hpp' -g '*.cpp' > /tmp/output

real    0.370
user    2.489
sys     1.371
maxmem  53 MB
faults  0

$ time rg-0.10.0 --no-config -o '#[[:space:]]*include[[:space:]]+"[^"]+"' -g '*.h' -g '*.hpp' -g '*.cpp' > /tmp/output

real    0.353
user    2.354
sys     1.276
maxmem  58 MB
faults  0

$ time ~/clones/ugrep/ugrep -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -Oh,hpp,cpp > /tmp/output

real    1.385
user    1.796
sys     1.980
maxmem  13 MB
faults  0

For your T-9 benchmark, it seems like -J1 doesn't force ugrep to a single thread? If I force the issue (using taskset on Linux), then I can observe a decrease in performance:

$ time rg -o '#[[:space:]]*include[[:space:]]+"[^"]+"' -g '*.h' -g '*.hpp' -g '*.cpp' -j1 > /tmp/output

real    1.972
user    1.416
sys     0.541
maxmem  12 MB
faults  0

$ time rg-0.10.0 --no-config -o '#[[:space:]]*include[[:space:]]+"[^"]+"' -g '*.h' -g '*.hpp' -g '*.cpp' -j1 > /tmp/output

real    1.834
user    1.206
sys     0.611
maxmem  12 MB
faults  0

$ time ~/clones/ugrep/ugrep -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -Oh,hpp,cpp -J1 > /tmp/output

real    1.295
user    0.666
sys     0.619
maxmem  12 MB
faults  0

$ time taskset -c 0 ~/clones/ugrep/ugrep -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -Oh,hpp,cpp > /tmp/output

real    1.924
user    0.813
sys     1.046
maxmem  12 MB
faults  0

As for T-1, there's definitely a performance bug in ripgrep there related to the -w flag. Nice find there.

Now that I've looked through all of your benchmarks, they appear to suffer from some pretty severe bias. In particular, it looks like all of them have very high match counts. While this isn't an uncommon use case, searching for things with very few results is itself a pretty common thing to do. It would be wise to expand your benchmarks to include cases like that. Otherwise, a significant portion of your benchmark set is just going to be testing match and printing overhead. (Which, again, definitely should be measured. But not to the exclusion of everything else.)

Thanks for your "weird claims" comments...? This is thoroughly tested and there are thousands of users of RE/flex and folks have replicated these results based on the files that are available.

Sorry, but I don't know that and I don't know who your users are. I've been working on regex engines for years and I've never heard of RE/flex.

without using Aho Corasick etc which I found too slow in almost all cases

You might already know this, but ripgrep doesn't use Aho-Corasick when searching a smallish (~100 or less) set of patterns. It uses a special SIMD accelerated algorithm that was developed in Hyperscan: https://github.com/BurntSushi/aho-corasick/tree/master/src/packed/teddy

My machine is an Intel i7-6900K 3.20GHz running Linux 5.3.0.

@genivia-inc
Copy link

genivia-inc commented Oct 1, 2019

This is great and confirms that ripgrep can be beaten (sorry), because I have a lot of room left to continue improving ugrep after 1.5.0 for short word searches, e.g. by using SSE/AVX/NEON that is not yet used by the ugrep version you're testing. These planned SIMD improvements of ugrep will positively affect tests T-2, T-3, T-8, T-9 which you elaborate on in your response and I knew were not yet as fast without SIMD, but faster than ripgrep 0.10.0 (which also lacked such optimizations it seems).

The latest ugrep 2.0 version is even faster than before (but no SIMD yet) and clearly shows it beats performance of other grep for many practical use cases.

For the early ugrep version 1.5.0 that you compared, it is only best for the kind of searches that don't solely rely on single short word searches as SIMD is not yet used to optimize single word matches. It has excellent (better than ripgrep) parallel efficiency for recursive search.

I've worked over 25 years in the research areas of compilers and high performance. For example, the GNU compiler uses my research work (since GCC 4.0) on cross-iteration dependence testing for SIMD auto-vectorization (chains of recurrences aka scalar evolutions). ugrep doesn't leverage SIMD/AVX yet and I have a plan to go about doing just that.

For your T-9 benchmark, it seems like -J1 doesn't force ugrep to a single thread?

This is a strange result with respect to threaded search: if -J1 did not change your timings, this must mean that threading is not enabled and performance may suffer as a result. There must be something wrong with your build steps, otherwise I can't explain this. Check what the recursive search query says with --stats, which reports the usage stats to stdout, including the number of worker threads. The performance is certainly increased on all platforms I've tested (a range of Mac OS X, Linux, Windows). I've used clang 9, GCC 7/8, Visual Studio 2015/2017.

Your question about RE/flex: RE/flex is not a regex library, but a new lexer analyzer generator. Scanning is the forte of RE/flex, not searching. I added ugrep a few months ago as a relatively simple example and have been thinking since to improve it to make it "industrial strength" (whatever that means). That early version of ugrep with the older RE/flex engine was not optimized for searching yet. It is interesting (and fun coding) to optimize the engine for search as well, with a few tweaks and new methods/algorithms added over the last few weeks. However, It is still not fully optimized as of 1.5.0 as there is no SIMD/AVX usage among other things. You see, the main difference with scanning (aka continuous matching, tokenizing, lexical analysis) is that you'll get a match for every character in the file, i.e. there are no gaps to search ahead. Searching on the other hand greatly benefits from an optimized memchr, SIMD/AVX, Boyer-Moore, Aho-Corasick etc.

which is kind of bizarre to suggest

Not really. I have my own set of benchmarks that suggest otherwise

With respect to Boost.Regex versus PCRE: there is a noticeable performance difference when regex engines are used for scanning, if you use the regex API smartly (e.g. don't extract matches as strings that are heap-allocated every time). The source code files for performance testing are available for download and linked in the CodeProject article. Quite a few people have looked into this too. And yes, Boost.Regex is faster that PCRE for this type of task, if you do it right. And some regex engines are unexpectedly terrible at this type of task. I've used the most efficient ways I could think of to compare and test the regex APIs, e.g. using iterators with Boost.Regex. Everything is precompiled to test compiled NFA/DFA hybrids fairly.

Overall, I need a tool such as ugrep that is compatible and with results that are consistent with GNU/BSD grep. Judging from forums and blogs, if the options are not compatible to GNU/BSD grep, a new grep tool is pretty much dead on arrival. My main focus is performance, compatibility, and safety (priorities are not necessarily in that order).

@BurntSushi
Copy link

because I have room left to continue improving

Well... I mean I do too... Best we can do is benchmark what we've got right now.

it is best for the kind of searches that don't solely rely on single short word searches

These are, by far, the most common kind of searches that ripgrep services. So I would be very careful about trying to claim that ugrep is faster in many practical use cases. A single search term is the single most practical example of using ripgrep in its target audience.

It also has excellent (better) parallel efficiency for recursive search.

I haven't really seen any evidence to support this.

There must be something wrong with your build steps

I followed the build steps you gave me exactly. Running with --stats:

$ ~/clones/ugrep/ugrep -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -Oh,hpp,cpp --stats
[... snip ...]
Searched 70277 files with 8 threads in 24799 directories: found 51765 matching files

$ time ~/clones/ugrep/ugrep -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -Oh,hpp,cpp > /tmp/output

real    1.322
user    1.769
sys     1.870
maxmem  13 MB
faults  0

You see, the main difference with scanning (aka continuous matching, tokenizing, lexical analysis) is that you'll get a match for every character in the file, i.e. there are no gaps to search ahead. Searching on the other hand greatly benefits from an optimized memchr, SIMD/AVX, Boyer-Moore, Aho-Corasick etc.

Ah I see, I did not realize this was the main goal of ugrep. Sounds like they are probably designed for two very different use cases then. Might be good to mention that in the README.

With respect to Boost.Regex versus PCRE: there is a noticeable performance difference when regex engines are used for scanning, if you use the regex API smartly (e.g. don't extract matches as strings that are heap-allocated every time). The source code files for performance testing are available for download and linked in the CodeProject article. Quite a few people have looked into this too.

These are crucial details. Looking back at your README, I can kind of see how it reads that way now. But it would probably help to add this important context to your benchmark results. It would ideally be nice if you made it easy for others to reproduce your benchmarks in a scriptable fashion. Like I did with benchsuite. And notice how the output documents everything pretty precisely, and provides the raw data in addition to confirming that match counts are correct.

Judging from forums and blogs, if the options are not compatible to GNU/BSD grep, a new grep tool is pretty much dead on arrival. My main focus is performance, compatibility, and safety (priorities are not necessarily in that order).

ripgrep is mostly compatible with GNU/BSD grep. Certainly, it is not my primary goal, but I don't go out of my way to break compatibility unless there is compelling reason to do so. There is much more in common between ripgrep and grep than there are differences. Either way, clearly, ripgrep is not "dead on arrival." ;-)

My main focus is really on the user experience and performance. Of course, I care about safety too, which is at least partially why I built ripgrep in a language that is safe by default.

@genivia-inc
Copy link

It also has excellent (better) parallel efficiency for recursive search.

I haven't really seen any evidence to support this.

Elementary, my friend. Parallel computing 101. Sorry for thinking a bit fast as it's all in the README if you extrapolate the numbers. This is based on ripgrep 0.10.0 in the README results, but a quick check shows similar numbers for ripgrep 1.5.0.

Recall that for P processors (P = number of cores or threads) we have:

$ S_P = \frac{t_s}{t_P} $ (speedup)
$ E_P = \frac{S_P}{P} $ (parallel efficiency)

Based on T-8 and T-9, the following is observed:

  • ripgrep thread count P >= 6, because of ~ 627% utilization reported by tcsh time. Let's be very fair to ripgrep and say that P = 6 even though it is likely P > 6 to minimize t_P as demonstrable on my machine with a common architecture (i7 clang 9, Mac OS X 10.12.6). Sequential time with -j1 is 0.36 and the parallel time is 0.12. Plugging in P = 6 we get:
    $ S_P = 0.36/0.12 = 3 $ (T-9 over T-8)
    $ E_P = 3/6 = 1/2 = 50% $
    Meaning the 6 threads are running 50% of their time and otherwise waiting (doing nothing).

  • ugrep thread count P >= 2, because of ~ 276% utilization. For sake of appeasement, let's be less fair to ugrep and say that P = 3 are needed to minimize t_P. With this we get:
    $ S_P = 0.20/0.10 = 2 $ (T-9 over T-8)
    $ E_P = 2/3 = 67% $
    Meaning the 3 threads are running 67% of their time and otherwise waiting (doing nothing).

As you know, increasing P when the parallel execution time t_P does not decrease is not useful and as a consequence the parallel efficiency goes down. Increasing P to >= 6 for ripgrep still decreases parallel execution time to t_P = 0.12 compared to t_s = 0.36, so picking P = 6 is fair. For ugrep, t_s = 0.20 and t_P = 0.10 for P = 3.

Hence, parallel efficiency 67% > 50% for a common parallel search task (T-8). Ergo, ugrep has better parallel efficiency even when we plug in numbers that are biased in favor of ripgrep.

Furthermore, parallel work states how well an algorithm is parallelized (lower is better as it requires fewer CPU cycles as a sum total over P). With P * E_P = 3 for ripgrep this is higher than 2 for ugrep. That's the intuition in the back of my head. This got me a bit excited to spend a few more days trying some new things with ugrep, and make RE/flex a better engine for search AND a fast scanner that it already is.

There is nothing peculiar here about the setup I've test that is perfectly replicable. Future testing will show if this swings one way or another. But for now, for our early ugrep stable release, these numbers look pretty good IMHO.

It's just the SIMD/AVX stuff that is needed to search short words faster, which is not that exciting to work on. But, yes, it is important. I did not say that it wasn't.

@BurntSushi
Copy link

perfectly replicable

I haven't been able to reproduce it, which is what I meant by evidence. You can spare the condescending lectures in your future responses, thanks.

@rbnor
Copy link

rbnor commented Oct 14, 2019

Filed an issue on it, ugrep doesnt spawn more threads at all it seems, sucks as i was really eager to test this now! Hoping for a quick fix.

Nice write ups, i love to read it all as i learn a lot, but its a bit sad to talk about parallel computing 101 when the released code doesn't work. I suggest at bit more of a respectful tone, this is all about improving and combining power to really make the most of open source. But "it works here" so its all good right? Now that's coding 101.

So far none of you guys have been able to beat find piped to xargs and zgrep on a lot of compressed files in a lot of directories, so i suggest less theory and more coding as well as real world benchmarking. So instead of focusing on beating each other on some of these tests, try to beat unix on my complex real world test. Thats a challenge, a test none of you designed. Best of luck.

@BurntSushi
Copy link

more coding as well as real world benchmarking

I've done plenty of both. That alone should be pretty clear. No need to talk down.

@rbnor
Copy link

rbnor commented Oct 14, 2019

Im was pretty sure the this time its personal milestone/tag was a joke now im not. Jk, but loosen up, i wasnt talking anyone down, or at least didnt mean to. It should be pretty clear that im hear to learn and evolve and not talk you down. And yes that is pretty obvious that all of you have done plenty of both. Yet the discussion ends up being a lecture in parallel programing 101 Instead of more actual tests on the cases none of you guys win at yet. This is about improving after all, not fighting.

I was refering to the condecending tone of geniva inc and in a few other comments here, and i suggested coding and benchmarking instead of endless theoretical writeups when the code in question doesnt even work at the moment.

And in the end, why and the theory doesnt matter that much if the results doesnt show improvements or can be reproduced. It matters more to me that a program is faster, rather than the for that it should be, or why. Although i love to learn about that as well.

@BurntSushi
Copy link

The "why" and "theory" do matter. To build tools like this, it's usually necessary to dig into that stuff. And discussing why something should be faster is absolutely critical, because it's basically akin to asking questions and either validating them or invalidating them, which is an extremely educational process.

@rbnor
Copy link

rbnor commented Oct 14, 2019

Of course it does! That’s not what I meant exactly. And yes, it is. Discussion is what drives this forward, never said anything about that, baseless claims are not. To say that why something should be faster matters more than what actually is the case is just not right. My point was that claims that cannot be backed up is reduced from why something should be faster, to why someone thinks something should be faster and thats pretty useless, because the theory is then wrong or the reasoning off. Yes you learn and evolve thats all nice but claims such as my parallelism is better than yours and i have room for improvement is baseless claims until proven in a reproduceable manner.

@genivia-inc
Copy link

No disrespect intended. I taught 20+ years of computer science to thousands of students and have a habit of explaining, in particular on matters when theory and practice diverge and when practice confirms theory. CS is truly a science of tradeoffs. A zero sum game of algorithms/programs that typically trades increased parallelism and speed for increased energy use and/or increased memory use. There is rarely an absolute win-win as these days may be gone. And no, to me grep-like tools are not close to being that important, even more so because I fully expect these will suffer one undeniable fate: the current state-of-the-art algorithms and design choices only work under the unrealistic assumption that software and hardware do not evolve. There are too many variables that depend on external factors that you cannot control. Take GNU grep's changes over the years, for example. Clever string search and matching algorithms are no longer as effective as used to be 5 to 10 years ago. In 5 to 10 years to come, SW and HW will continue to evolve. Worse, contemporary speedup and performance studies can only offer a snapshot of (actionable) information that might be useful to figure out what currently works (not so) well, what is comparable, and what approach may work better perhaps. But performance comparisons are fraught with issues, not to mention a fate of irrelevance. I mean, how many research publications that are 5 years old or older with results that are solely based on empirical performance results are still relevant today? I'm just repeating what is generally discussed in our circles.

Filed an issue on it, ugrep doesnt spawn more threads at all it seems, sucks as i was really eager to test this now! Hoping for a quick fix.

I've tested on Mac (i7 4 core, SSD), CentOS (56 core, NFS), Raspbian (a slow RPi 3B, class 10 flash), and Windows (i5, SSD) but don't see an issue with threads not being spawned when more than one file is searched, nor did any of my colleagues. I can ask them to chime in if need be. In all for cases when more than one file is searched there is a speedup over a single thread with -J1. As such, I am a bit skeptical at this point to hear that "it does not work" without details to replicate the problem. As always, an issue cannot be fixed if it cannot be reliably reproduced, which means it may not be an issue in the first place.

The ugrep C++11 code is pretty transparent on this matter. The code checks for hardware threads with std::thread::hardware_concurrency (with a few tweaks, as it is reasonable to assume that HTT won't be as efficient given the IO bound nature of file search). If std::thread::hardware_concurrency returns zero, 2 worker threads are still spawn and that still gives a noticeable speedup on the platforms we've tested.

I've investigated the use of pipes to buffer and merge the output, but that slowed things down for common cases, at least in initial tests to investigate alternative approaches. In some cases you may get an improvement. However, this is a clear case of speed versus memory use as a tradeoff, and as memory use grows, so does a negative impact of memory hierarchy on performance, eventually.

Finally, ugrep is a small spinoff project that continues to evolve, and perhaps more importantly, it helps us with RE/flex development cycles. An updated version is in the works.

@rbnor
Copy link

rbnor commented Oct 14, 2019

I filed an issue to be able to provide all the needed info there. While its a somewhat IObound process it clearly wasnt limited by it as other ways benefited from parallelism. should be possible to do what find piped to xargs to zgrep does, just faster, which is interesting because all zgrep should do is zcat which invokes gzip -cd, and then pipe to grep. Testing with parallel vs xargs shows that the improvement lies with using xargs for parallelism, which is interesting. This approach doesnt use a lot of ram at all, and really shouldnt as that would be less effective the way i see it.

Are you sure you guys are compiling from master and not some not released code? I had to add a linking flag in the arguments for the compiler within the configure file(made a pull request to fix that), which makes me think that you guys perhaps have different code, or use another way of building it.

On ubuntu 16.04 this is what i get:

Genivia/ugrep#4

@genivia-inc
Copy link

Cool. Thanks for the details. Will take a look soon.

@genivia-inc
Copy link

Your tests ran into an incomplete implementation of decompression and buffering in ugrep, which slowed things down a lot as it decompressed the input byte-by-byte!

Also, your use of xargs -P8 to run parallel searches seems problematic as I get vfork errors running ripgrep 0.10.0 giving incorrect results. The same happens with other greps. Not sure if this is an OS resource limitation of xargs, which could be the case.

Here are some of the changes I've made in ugrep v1.5.4:

  • added task parallelism to decompress files in parallel to searching, initial tests show that this makes ugrep -z (decompress) as fast or faster than piping the output of gunzip to ugrep.
  • improved asynchronous IO implementation for faster parallel search results.
  • improved input buffering speed when reading standard input instead of files.

Because all lines of the input files match, your suggested tests are a significant outlier compared to the usual queries that return more sparse results.

I reran the tests you conducted, now showing parallel speedups and also show how much faster ugrep v1.5.4 is:

time ugrep -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.29 real         0.49 user         0.04 sys
 1000000

With one thread (J1) takes 0.49s instead of 0.29s showing a good speedup for two concurrent searches:

time ugrep -J1 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.49 real         0.48 user         0.02 sys
 1000000

Using xargs -P2 is about the same as running ugrep with two threads (2 for 2 files plus 2 for decompressing the files using task parallelism):

time ./findxargs.sh | wc -l
        0.31 real         0.50 user         0.05 sys
 1000000

where findxargs.sh contains the line find 2019-01-1 -name '2019-01-1.conn.[12].log.gz' -print0 | xargs -P2 -0 ugrep -z -F 10.0.0.1

Note: ugrep options -o, -c, -l, -q, and -N are much faster than without these for v1.5.4, because line-by-line reading and matching is not yet fully optimized in v1.5.4 (work in progress), for example with -o the time drops to only 0.16s:

time ugrep -zo -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.16 real         0.26 user         0.03 sys
 1000000

and with -c:

time ugrep -zc -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.07 real         0.15 user         0.01 sys
       2

Now the ugrep -zo time of 0.16s is faster than ripgrep's -zo time of 0.20s while both use threads:

time ripgrep -zo -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.20 real         0.27 user         0.09 sys
 1000000

The next version of ugrep has further speed improvements and additional features. Coming soon!!

@rbnor
Copy link

rbnor commented Nov 7, 2019

ugrep 1.56 is by far the fastest grep for searching compressed logs now. Impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants