Runtime differences single- vs multi-threaded host applications #1600

awelzel · 2023-11-24T18:23:31Z

This is again a micro-benchmark, but I think an interesting observation. Relates to zeek/zeek#3379.

When running spicy-driver for micro-benchmarking, glibc is running in single-threaded mode, avoiding usage of atomic instructions for std::shared_ptr. Within multi-threaded applications like Zeek, std::shared_ptr usage is more expensive. Patching spicy-driver.cc to start a very short-lived thread, thereby switching glibc into multi-threaded mode, the attached micro-benchmark runs 6% slower on my system due to what seems just use of atomic instructions for std::shared_ptr.

Recording the spicy-driver run with perf record --call-graph dwarf, the __gnu_cxx::__exchange_and_add function is reported with ~5.8% samples as hottest function. In a zeek -r test with the QUIC analyzer, it shows up with ~3% samples.

Not quite sure there's something that can be fixed unless removal of std::shared_ptr is on the table, but opening mostly as FYI.

(Testing was done with code from #1590)

$ hyperfine -w 1 'cat test-data.txt | spicy-driver nested.hlto' 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
Benchmark 1: cat test-data.txt | spicy-driver nested.hlto
  Time (mean ± σ):     11.364 s ±  0.067 s    [User: 11.343 s, System: 0.038 s]
  Range (min … max):   11.255 s … 11.463 s    10 runs
 
Benchmark 2: cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto
  Time (mean ± σ):     12.011 s ±  0.140 s    [User: 11.982 s, System: 0.036 s]
  Range (min … max):   11.885 s … 12.357 s    10 runs
 
Summary
  'cat test-data.txt | spicy-driver nested.hlto' ran
    1.06 ± 0.01 times faster than 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'

module Test;

type J = unit {
  x: bytes &size=1;
};

type K = unit {
  j: J;
};

type L = unit {
  k1: K;
  k2: K;
  k3: K;
  k4: K;
};

type M = unit {
  l1: L;
  l2: L;
  l3: L;
  l4: L;
};

public type N = unit {
  : M[] &eod;
};

Patch to spicy-driver.cc:

--- a/spicy/toolchain/bin/spicy-driver.cc
+++ b/spicy/toolchain/bin/spicy-driver.cc
@@ -2,8 +2,11 @@
 
 #include <getopt.h>
 
+#include <cstdio>
+#include <cstdlib>
 #include <fstream>
 #include <iostream>
+#include <thread>
 
 #include <hilti/rt/libhilti.h>
 
@@ -289,6 +292,18 @@ int main(int argc, char** argv) {
                 if ( ! parser )
                     driver.fatalError(parser.error());
 
+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
+               if ( std::getenv("SPICY_THREAD") != nullptr ) {
+                       std::fprintf(stderr, "starting thread\n");
+                       std::thread t([]() {
+                               std::fprintf(stderr, "running in thread\n");
+                       });
+                       t.join();
+               }
+
+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
                 if ( auto x = driver.processInput(**parser, in, driver.opt_increment); ! x )
                     driver.fatalError(x.error());
             }

The text was updated successfully, but these errors were encountered:

bbannier · 2023-11-27T08:17:51Z

Thanks for this issue @awelzel, this is a useful microbenchmarking result. We use std::shared_ptr extensively in the runtime library, e.g., as control blocks in views and our safe iterators which we copy heavily; we definitely want to avoid unneeded overhead there.

I ran your reproducer with Clang and libc++ and I observe no significant overhead, and we do not want overhead for GCC and libstdc++ either.

bbannier · 2024-01-22T08:16:35Z

Unassigning myself since ATM there is no clear path forward which doesn't involve a lot of work like implementing a standard libary-quality smart pointer library.

rsmmr · 2024-06-07T09:55:22Z

I don't see us changing the smart pointer any time soon, so closing because of the lack of a way forward.

bbannier self-assigned this Nov 27, 2023

This was referenced Nov 27, 2023

Replace uses of std::shared_ptr with shared ptr without atomic refcounts #1601

Closed

[Spicy] Use shared ptr without atomic refcounts zeek/zeek#3474

Closed

bbannier closed this as completed Dec 11, 2023

bbannier reopened this Dec 11, 2023

bbannier removed their assignment Jan 22, 2024

awelzel mentioned this issue Jan 25, 2024

bytes for loop performance #1663

Open

rsmmr closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime differences single- vs multi-threaded host applications #1600

Runtime differences single- vs multi-threaded host applications #1600

awelzel commented Nov 24, 2023

bbannier commented Nov 27, 2023

bbannier commented Jan 22, 2024

rsmmr commented Jun 7, 2024

Runtime differences single- vs multi-threaded host applications #1600

Runtime differences single- vs multi-threaded host applications #1600

Comments

awelzel commented Nov 24, 2023

bbannier commented Nov 27, 2023

bbannier commented Jan 22, 2024

rsmmr commented Jun 7, 2024