Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace uses of std::shared_ptr with shared ptr without atomic refcounts #1601

Closed
wants to merge 2 commits into from

Conversation

bbannier
Copy link
Contributor

Closes #1600.

@bbannier bbannier self-assigned this Nov 27, 2023
@awelzel
Copy link
Contributor

awelzel commented Nov 27, 2023

The microbenchmark from #1600 doesn't show the single vs multi-threading difference anymore. This branch appears to run a tiny bit slower than main when single-threaded though. I haven't looked why, maybe std::shared_ptr is overall faster than the boost version even when atomic instructions are in-use?

$ hyperfine -w 1 '/opt/zeek-dev-prod/bin/spicy-driver -f test-data.txt nested-prev.hlto' 'spicy-driver -f test-data.txt nested.hlto'
Benchmark 1: /opt/zeek-dev-prod/bin/spicy-driver -f test-data.txt nested-prev.hlto
  Time (mean ± σ):     12.137 s ±  0.209 s    [User: 12.112 s, System: 0.010 s]
  Range (min … max):   11.996 s … 12.704 s    10 runs
 
Benchmark 2: spicy-driver -f test-data.txt nested.hlto
  Time (mean ± σ):     12.360 s ±  0.158 s    [User: 12.347 s, System: 0.011 s]
  Range (min … max):   12.134 s … 12.529 s    10 runs
 
Summary
  '/opt/zeek-dev-prod/bin/spicy-driver -f test-data.txt nested-prev.hlto' ran
    1.02 ± 0.02 times faster than 'spicy-driver -f test-data.txt nested.hlto'

/opt/zeek-dev-prod/bin/spicy-driver being current main.

We heavily use `std::shared_ptr` in the runtime library in types which
we expect to be cheap to copy, e.g., in views or safe iterators. By
design `std::shared_ptr` has to use atomic ints for its strong and weak
counts. Since parsing is single threaded the use of atomics can
introduce overhead without any need.

This patch replaces uses of `std::shared_ptr` in the runtime library
with a shared_ptr without atomic refcounts. Since we still need weak
pointers this has to be a slightly more complete replacement.
This uses our shared pointer with non-atomic refcounts from the runtime
library elsewhere in Spicy, noticeably in the compiler. This might
provide slight performance improvements on platforms where
`std::shared_ptr`'s atomic refcounts made it slightly more expensive to
copy.
@bbannier
Copy link
Contributor Author

I benchmarked this PR against a huge real world Zeek analyzer we are developing at Corelight. In summary, it seems doubtful that this PR provides a clear benefit when running real analyzers in Zeek.

This branch vs. main

Before (main):

$ hyperfine -w 1 'zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts'
Benchmark 1: zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts
  Time (mean ± σ):     19.428 s ±  0.419 s    [User: 20.509 s, System: 1.326 s]
  Range (min … max):   18.884 s … 20.448 s    10 runs

After (this PR):

$ hyperfine -w 1 'zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts'
Benchmark 1: zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts
  Time (mean ± σ):     19.223 s ±  0.371 s    [User: 20.314 s, System: 1.310 s]
  Range (min … max):   18.774 s … 19.844 s    10 runs

I want to see that the mean runtime of this particular benchmark seems to improve, but the difference here was not significant.

This branch on top of #1590

Among other things in #1590 we also removed some unneeded copying, so we might expect to see different behavior of this PR since e.g., overall copying of Bytes instances might be reduced.

Before (#1590):

$ hyperfine -w 1 'zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts'
Benchmark 1: zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts
  Time (mean ± σ):     17.308 s ±  0.299 s    [User: 18.322 s, System: 1.331 s]
  Range (min … max):   16.860 s … 17.813 s    10 runs

After (this PR on top of #1590):

$ hyperfine -w 1 'zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts'
Benchmark 1: zeek -Cr large.pcap ../build/spicy-modules/*.hlto ../analyzer/scripts
  Time (mean ± σ):     17.578 s ±  0.335 s    [User: 18.607 s, System: 1.297 s]
  Range (min … max):   17.120 s … 18.134 s    10 runs

The trend of the means seems to be reversed here, i.e., the new "improved" smart pointer implementation seems to introduce new overhead, but the result again is not significant.

@bbannier
Copy link
Contributor Author

We won't move forward with this ATM.

@bbannier bbannier closed this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runtime differences single- vs multi-threaded host applications
2 participants