Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime differences single- vs multi-threaded host applications #1600

Closed
awelzel opened this issue Nov 24, 2023 · 3 comments
Closed

Runtime differences single- vs multi-threaded host applications #1600

awelzel opened this issue Nov 24, 2023 · 3 comments

Comments

@awelzel
Copy link
Contributor

awelzel commented Nov 24, 2023

This is again a micro-benchmark, but I think an interesting observation. Relates to zeek/zeek#3379.

When running spicy-driver for micro-benchmarking, glibc is running in single-threaded mode, avoiding usage of atomic instructions for std::shared_ptr. Within multi-threaded applications like Zeek, std::shared_ptr usage is more expensive. Patching spicy-driver.cc to start a very short-lived thread, thereby switching glibc into multi-threaded mode, the attached micro-benchmark runs 6% slower on my system due to what seems just use of atomic instructions for std::shared_ptr.

Recording the spicy-driver run with perf record --call-graph dwarf, the __gnu_cxx::__exchange_and_add function is reported with ~5.8% samples as hottest function. In a zeek -r test with the QUIC analyzer, it shows up with ~3% samples.

Not quite sure there's something that can be fixed unless removal of std::shared_ptr is on the table, but opening mostly as FYI.

(Testing was done with code from #1590)

$ hyperfine -w 1 'cat test-data.txt | spicy-driver nested.hlto' 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
Benchmark 1: cat test-data.txt | spicy-driver nested.hlto
  Time (mean ± σ):     11.364 s ±  0.067 s    [User: 11.343 s, System: 0.038 s]
  Range (min … max):   11.255 s … 11.463 s    10 runs
 
Benchmark 2: cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto
  Time (mean ± σ):     12.011 s ±  0.140 s    [User: 11.982 s, System: 0.036 s]
  Range (min … max):   11.885 s … 12.357 s    10 runs
 
Summary
  'cat test-data.txt | spicy-driver nested.hlto' ran
    1.06 ± 0.01 times faster than 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
module Test;

type J = unit {
  x: bytes &size=1;
};

type K = unit {
  j: J;
};

type L = unit {
  k1: K;
  k2: K;
  k3: K;
  k4: K;
};

type M = unit {
  l1: L;
  l2: L;
  l3: L;
  l4: L;
};

public type N = unit {
  : M[] &eod;
};

Patch to spicy-driver.cc:

--- a/spicy/toolchain/bin/spicy-driver.cc
+++ b/spicy/toolchain/bin/spicy-driver.cc
@@ -2,8 +2,11 @@
 
 #include <getopt.h>
 
+#include <cstdio>
+#include <cstdlib>
 #include <fstream>
 #include <iostream>
+#include <thread>
 
 #include <hilti/rt/libhilti.h>
 
@@ -289,6 +292,18 @@ int main(int argc, char** argv) {
                 if ( ! parser )
                     driver.fatalError(parser.error());
 
+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
+               if ( std::getenv("SPICY_THREAD") != nullptr ) {
+                       std::fprintf(stderr, "starting thread\n");
+                       std::thread t([]() {
+                               std::fprintf(stderr, "running in thread\n");
+                       });
+                       t.join();
+               }
+
+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
                 if ( auto x = driver.processInput(**parser, in, driver.opt_increment); ! x )
                     driver.fatalError(x.error());
             }

@bbannier
Copy link
Contributor

Thanks for this issue @awelzel, this is a useful microbenchmarking result. We use std::shared_ptr extensively in the runtime library, e.g., as control blocks in views and our safe iterators which we copy heavily; we definitely want to avoid unneeded overhead there.

I ran your reproducer with Clang and libc++ and I observe no significant overhead, and we do not want overhead for GCC and libstdc++ either.

@bbannier
Copy link
Contributor

Unassigning myself since ATM there is no clear path forward which doesn't involve a lot of work like implementing a standard libary-quality smart pointer library.

@bbannier bbannier removed their assignment Jan 22, 2024
@rsmmr
Copy link
Member

rsmmr commented Jun 7, 2024

I don't see us changing the smart pointer any time soon, so closing because of the lack of a way forward.

@rsmmr rsmmr closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants