Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessBackgroundActions and max cache sizes tuning questions #260

Open
lano1106 opened this issue Aug 22, 2024 · 15 comments
Open

ProcessBackgroundActions and max cache sizes tuning questions #260

lano1106 opened this issue Aug 22, 2024 · 15 comments

Comments

@lano1106
Copy link

lano1106 commented Aug 22, 2024

Hi,

  1. is the thread calling tcmalloc::MallocExtension::ProcessBackgroundActions()
    must be detached, or joinable?

If joinable is ok, can you do the following at program teardown:

  tcmalloc::MallocExtension::SetBackgroundProcessActionsEnabled(false);
  bkgThread.join();

by quickly scanning MallocExtension_Internal_ProcessBackgroundActions(), I have concluded that it was doing more than just release memory...

is this correct?
because I am planning keep BackgroundReleaseRate to its zero default value to avoid generating madvise() system calls...

  1. tcmalloc::MallocExtension::SetBackgroundProcessSleepInterval()

default value is 1 second.

What is the reasonable range for that parameter?

1 second - 1 hour ?

What are the tradeoffs of a long interval vs short interval?

I had in mind 30 seconds for my app.

Does the background housekeeping tasks induce lock contention to the threads using the allocator?

  1. Which stats to look at to determine that increasing the max cache sizes could have a positive effect?

overflows / underflows ratio?
anything else?

@lano1106
Copy link
Author

lano1106 commented Aug 25, 2024

I have been running the gperftools benchmark to compare glibc, gperftools and this project...
https://github.com/gperftools/gperftools/blob/master/benchmark/malloc_bench.cc

unless tweaking the params that I have requested assistance with change anything, it seems like there is an obvious winner...

there must be something that your project do that is better in some way and I need your help to appreciate what makes your lib shine because so far, I am unconvinced...

image

@ckennelly
Copy link
Collaborator

Assuming your timings are in nanoseconds, these all seem high for our fastpath on modern hardware in a microbenchmark. The main issue that likely springs to mind is glibc managing rseq registration and preventing TCMalloc's fastpath from being active.

All but two of our defaults in the Github repo reflect our production configuration defaults, as noted here.

@lano1106
Copy link
Author

lano1106 commented Aug 25, 2024

Hi Chris,

yes the times are in nanoseconds

I have noted that your prod config differ. The attempt to mimick what you do is the motivation behind opening this issue.

As you know, I am perfectly aware of the possible rseq conflict with glibc. I did launch the tcmalloc bench with:

GLIBC_TUNABLES=glibc.pthread.rseq=0 ./malloc_bench

but I can be even more explicit by testing tcmalloc::MallocExtension::PerCpuCachesActive() from main() before executing the tests...

I'll do that and report back:

int main(void)
{
  printf("Trying to randomize freelists..."); fflush(stdout);
  randomize_size_classes();
  printf("done. PerCpuCachesActive:%s\n",
         tcmalloc::MallocExtension::PerCpuCachesActive()?"true":"false");

Trying to randomize freelists...done. PerCpuCachesActive:true

the provided tcmalloc execution time results are with the rseq perCpu cache enabled...

on a sapphirerapids CPU running kernel 6.10.4... binaries have been compiled with:
bazel build --features=-default_compile_flags --copt "-march=sapphirerapids" --copt "-O3" --copt "-flto" --copt "-std=c++26" --copt "-fno-rtti" //benchmark:malloc_bench

benchmark/BUILD:

cc_binary(
    name = "malloc_bench",
    srcs = [
        "malloc_bench.cc", "run_benchmark.cc", "run_benchmark.h"
    ],
    deps = [
        "@com_google_tcmalloc//tcmalloc:tcmalloc",
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

cc_binary(
    name = "pt2malloc_bench",
    srcs = [
        "pt2malloc_bench.cc", "run_benchmark.cc", "run_benchmark.h"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

@lano1106
Copy link
Author

lano1106 commented Aug 25, 2024

I'll redo the test just after a reboot with nothing else running... just in case...
but to be fair, pt2malloc and gperftools have been run in the exact same setup.

So I was comparing apples with apples in the first place...

I am just going to put all the odds for tcmalloc to shine...

for the record, the bench program does not appear to be multithreaded... My system has isolated NOHZ_FULL nodes... unless a thread/process is pinned on a particular core, all processes run on the same default CPU... but they can be context switched...

Update:
I might have a small linking glitch... I am investigating this possibility...

Update 2:
by running perf on the gperftools malloc_bench, I have discovered that it was using libc malloc despite linking with libtcmalloc_minimal.a. (the second time that I have rebuilt the project only. the provided numbers are valid)

This was caused by the --as-needed linker switch... I'll perform the same test with the google tcmalloc malloc bench binary but I think that this one was not impacted since I am not using the --as-needed switch in my bazel BUILD file... this switch was coming from a global setting on my system stored in /etc.

with -O3, std=c++26 and LTO, my last gperftools run is even faster than the first one!!!

Benchmark: bench_fastpath_throughput                        5.852132 nsec
Benchmark: bench_fastpath_throughput                        5.898217 nsec
Benchmark: bench_fastpath_throughput                        5.813867 nsec

@lano1106
Copy link
Author

lano1106 commented Aug 26, 2024

it seems like google tcmalloc is slowed down by:

+   23.01%    23.00%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]
+    8.29%     8.29%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)

those slowdowns does not appear to be present in the original gperftools version...

About the PageMap code... I am surprised that it show up in the perf report at all... just maybe, the template plumbing stops the compiler to inline this code... I use gcc 14.2.1

Since you have Chandler Carruth in your team, I assume that you do use clang... That would be very interesting that you build malloc_bench and you report the numbers you get

@lano1106
Copy link
Author

I have reproduced the benchmarks initially performed by Sanjay at
https://gperftools.github.io/gperftools/tcmalloc.html

back in ~2006 when gperftools have been published, its tcmalloc was dominating glibc with every parameters. A lot of things have changed. glibc has improved a lot
mops_vs_size
mops_vs_threads

The takeaway that I see when looking at the results are:

  1. glibc is the best allocator as soon as there 4 or more concurrent accesses
  2. gperftools is the best option for very low contention scenarios and for bigger allocations
  3. google tcmalloc is unfortunately the slowest allocator in all categories...

@dvyukov
Copy link
Contributor

dvyukov commented Aug 28, 2024

it seems like google tcmalloc is slowed down by:

  • 23.01% 23.00% malloc_bench malloc_bench [.]
    tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]
  • 8.29% 8.29% malloc_bench malloc_bench [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)

If you see these functions, you probably do debug build.
IsNormalMemory is really a trivial one line helper that's always inlined in release build.

For tight new/delete benchmark you should see only 2 functions from tcmalloc: new and delete. These should not have a stack frame as well.

@lano1106
Copy link
Author

lano1106 commented Aug 28, 2024

I provide the bazel build cmd line + the BUILD file used 2-3 entries ago...

something that I have added in the cmd line is --strip never so that perf can provide me the function names in its reports...

beside that, I am not seeing how I could be using the debug build...

the compile options are: -O3 -g

about the stack frame, bazel adds a bunch of crap by default including:
FORTIFY_SOURCE and -fno-omit-frame-pointer

this is to not have all that force-feed crap that --features=-default_compile_flags is used...

you can quickly review what I do and tell me if there is something missing but I don't think that I have made anything that would give me a debug build...

ptmalloc/BUILD:

cc_binary(
    name = "t-test1",
    srcs = [
        "t-test1.c", "lran2.h", "t-test.h", "thread-st.h"
    ],
    deps = [
        "@com_google_tcmalloc//tcmalloc:tcmalloc",
    ],
    defines = [
        "USE_PTHREADS"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

cc_binary(
    name = "ptmalloc_test1",
    srcs = [
        "t-test1.c", "lran2.h", "t-test.h", "thread-st.h"
    ],
    defines = [
        "USE_PTHREADS"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

@ckennelly
Copy link
Collaborator

Are you running the background thread? Are the timings after things have warmed up?

@lano1106
Copy link
Author

no background thread... see my questions in the first post... I am seeking help to correctly setup the background thread...

Sanjay testing params are 1000000 iterations... plenty of time to warm up... beside, it is the exact same conditions for every allocators...

I am not sure what we are doing here... I have the impression some shady not well understood explanations are thrown at the test/build params as being responsible plausibly for the bad result instead of searching for the root of the problem.

if there are some mandatory settings needed for good performance on basic benchmarks, should it not be the default values so that tcmalloc is at least on par with its predecessor?

You can reproduce what I have done... All I did is run your allocator with gperftools testing programs... Give it a shot and if you are able to get different and much better results than I did, please document how you have achieve that...

ps.: t-test1.c got deleted from the project but you can retrieve it by digging the commits with git...

@dvyukov
Copy link
Contributor

dvyukov commented Aug 28, 2024

It matter how the tcmalloc library is compiled. If you add -O3 for cc_binary, it does not affect dependent libraries.
Also -O3 should go into copts, not linkopts.

@lano1106
Copy link
Author

lano1106 commented Aug 28, 2024

-O3 is included for the compiler and the linker. Passing -O3 to the linker is valid. from ld man page:

       -O level
           If  level  is  a numeric values greater than zero ld optimizes the output.  This might take significantly longer and therefore probably
           should only be enabled for the final binary.  At the moment this option only affects ELF shared library generation.  Future releases of
           the linker may make more use of this option.  Also currently there is no difference in the linker's behaviour  for  different  non-zero
           values of this option.  Again this may change with future releases.

it is also added to the bazel build cmd line... Optimization, AFAIK should be applied all over the place on every thing:

bazel build --features=-default_compile_flags --strip never --copt "-march=sapphirerapids" --copt "-O3" --copts "-g" --copt "-flto" --copt "-std=c++26" --copt "-fno-rtti" //benchmark:malloc_bench

@lano1106
Copy link
Author

it seems like google tcmalloc is slowed down by:

  • 23.01% 23.00% malloc_bench malloc_bench [.]
    tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]
* 8.29%     8.29%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)

If you see these functions, you probably do debug build. IsNormalMemory is really a trivial one line helper that's always inlined in release build.

For tight new/delete benchmark you should see only 2 functions from tcmalloc: new and delete. These should not have a stack frame as well.

this is to be confirmed but I am pretty sure that profilers are able to tell in which inlined functions they are in when sampled in the corresponding code section when the debug symbols are available...

@lano1106
Copy link
Author

lano1106 commented Aug 29, 2024

I have included jemalloc into the benchmark...

something suspicious is that all allocators appears to suffer from a lock contention except glibc... what could be the common denominator?

I have no clue on how the thread_local keyword is implemented but this could be this guy...

mops_vs_threads_jemalloc

update:
I have just checked glibc/malloc/malloc.c, glibc also uses TLS so this is not it... maybe some false sharing when updating some global atomic stats variables...

@lano1106
Copy link
Author

lano1106 commented Sep 1, 2024

there was a glitch in my measurement methodology. Here are the updated graphs:
mops_vs_size_v2
mops_vs_threads_jemalloc_v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants