-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessBackgroundActions and max cache sizes tuning questions #260
Comments
I have been running the gperftools benchmark to compare glibc, gperftools and this project... unless tweaking the params that I have requested assistance with change anything, it seems like there is an obvious winner... there must be something that your project do that is better in some way and I need your help to appreciate what makes your lib shine because so far, I am unconvinced... |
Assuming your timings are in nanoseconds, these all seem high for our fastpath on modern hardware in a microbenchmark. The main issue that likely springs to mind is glibc managing All but two of our defaults in the Github repo reflect our production configuration defaults, as noted here. |
Hi Chris, yes the times are in nanoseconds I have noted that your prod config differ. The attempt to mimick what you do is the motivation behind opening this issue. As you know, I am perfectly aware of the possible rseq conflict with glibc. I did launch the tcmalloc bench with:
but I can be even more explicit by testing tcmalloc::MallocExtension::PerCpuCachesActive() from main() before executing the tests... I'll do that and report back:
the provided tcmalloc execution time results are with the rseq perCpu cache enabled... on a sapphirerapids CPU running kernel 6.10.4... binaries have been compiled with: benchmark/BUILD:
|
I'll redo the test just after a reboot with nothing else running... just in case... So I was comparing apples with apples in the first place... I am just going to put all the odds for tcmalloc to shine... for the record, the bench program does not appear to be multithreaded... My system has isolated NOHZ_FULL nodes... unless a thread/process is pinned on a particular core, all processes run on the same default CPU... but they can be context switched... Update: Update 2: This was caused by the --as-needed linker switch... I'll perform the same test with the google tcmalloc malloc bench binary but I think that this one was not impacted since I am not using the --as-needed switch in my bazel BUILD file... this switch was coming from a global setting on my system stored in /etc. with -O3, std=c++26 and LTO, my last gperftools run is even faster than the first one!!!
|
it seems like google tcmalloc is slowed down by:
those slowdowns does not appear to be present in the original gperftools version... About the PageMap code... I am surprised that it show up in the perf report at all... just maybe, the template plumbing stops the compiler to inline this code... I use gcc 14.2.1 Since you have Chandler Carruth in your team, I assume that you do use clang... That would be very interesting that you build malloc_bench and you report the numbers you get |
I have reproduced the benchmarks initially performed by Sanjay at back in ~2006 when gperftools have been published, its tcmalloc was dominating glibc with every parameters. A lot of things have changed. glibc has improved a lot The takeaway that I see when looking at the results are:
|
If you see these functions, you probably do debug build. For tight new/delete benchmark you should see only 2 functions from tcmalloc: new and delete. These should not have a stack frame as well. |
I provide the bazel build cmd line + the BUILD file used 2-3 entries ago... something that I have added in the cmd line is beside that, I am not seeing how I could be using the debug build... the compile options are: -O3 -g about the stack frame, bazel adds a bunch of crap by default including: this is to not have all that force-feed crap that you can quickly review what I do and tell me if there is something missing but I don't think that I have made anything that would give me a debug build... ptmalloc/BUILD:
|
Are you running the background thread? Are the timings after things have warmed up? |
no background thread... see my questions in the first post... I am seeking help to correctly setup the background thread... Sanjay testing params are 1000000 iterations... plenty of time to warm up... beside, it is the exact same conditions for every allocators... I am not sure what we are doing here... I have the impression some shady not well understood explanations are thrown at the test/build params as being responsible plausibly for the bad result instead of searching for the root of the problem. if there are some mandatory settings needed for good performance on basic benchmarks, should it not be the default values so that tcmalloc is at least on par with its predecessor? You can reproduce what I have done... All I did is run your allocator with gperftools testing programs... Give it a shot and if you are able to get different and much better results than I did, please document how you have achieve that... ps.: t-test1.c got deleted from the project but you can retrieve it by digging the commits with git... |
It matter how the tcmalloc library is compiled. If you add -O3 for cc_binary, it does not affect dependent libraries. |
-O3 is included for the compiler and the linker. Passing -O3 to the linker is valid. from ld man page:
it is also added to the bazel build cmd line... Optimization, AFAIK should be applied all over the place on every thing:
|
this is to be confirmed but I am pretty sure that profilers are able to tell in which inlined functions they are in when sampled in the corresponding code section when the debug symbols are available... |
I have included jemalloc into the benchmark... something suspicious is that all allocators appears to suffer from a lock contention except glibc... what could be the common denominator? I have no clue on how the thread_local keyword is implemented but this could be this guy... update: |
Hi,
must be detached, or joinable?
If joinable is ok, can you do the following at program teardown:
by quickly scanning MallocExtension_Internal_ProcessBackgroundActions(), I have concluded that it was doing more than just release memory...
is this correct?
because I am planning keep BackgroundReleaseRate to its zero default value to avoid generating madvise() system calls...
default value is 1 second.
What is the reasonable range for that parameter?
1 second - 1 hour ?
What are the tradeoffs of a long interval vs short interval?
I had in mind 30 seconds for my app.
Does the background housekeeping tasks induce lock contention to the threads using the allocator?
overflows / underflows ratio?
anything else?
The text was updated successfully, but these errors were encountered: