Replies: 9 comments 10 replies
-
Hi @interwq. DuckDB does not use MMAP, would it make sense to enable retain? I'm not sure if I understand what exactly the setting does. We've recently had an issue with jemalloc on ARM64 because In general, we like jemalloc as it is gives us less memory fragmentation and better allocation performance than glibc's allocator, but we struggle with portability. I've tried to make the code as portable as possible, with my limited understanding of the library. I haven't been able to get it to work for Windows (I've fiddled with the library quite a bit to make it compile as C++11, and couldn't get something with I've looked into mozilla's mozjemalloc, as it has made |
Beta Was this translation helpful? Give feedback.
-
Yes, it's just zstd compressed parquet files with a ~ 100-500 byte varchar column. Total DB size was about 1.7 GB (compressed) Done on x86-64 LInux Kernel 5.19.0
HIghly recommend enabling the background thread. It's a key performance mechanism for jemalloc. https://dl.acm.org/doi/10.1145/2742580.2742807 talks about the history and rationale behind the background thread. Especially with tweaks like the improvement above, it makes a huge difference in malloc perf |
Beta Was this translation helpful? Give feedback.
-
@lnkuiper thanks for sharing the insights! re: This feature was in fact motivated by the Like Ben mentioned, background_thread is recommended since it helps pretty much all cases. In addition to the efficiency gains, it also limits the worst case behavior, e.g. it guarantees progress on dirty page purging, even if the application threads all go sleeping. You can also control the number of threads used with the max_background_thread option if that's a concern. We will be happy to suggest some other tuning options, given that you have a specific workload, it would also help if you can share the malloc_stats output from a typical run (e.g. using either opt.stats_interval or opt.stats_print) |
Beta Was this translation helpful? Give feedback.
-
@lnkuiper re: background thread, for the context that option can be enabled / disabled on the fly via The malloc conf you shared looks very reasonable to me. The other option I want to mention is the heap profiling feature, which is a sample based approach for memory usage profiling in case you haven't tried. About the portability issues, would you consider running your CIs on jemalloc's dev branch (https://github.com/jemalloc/jemalloc/tree/dev)? That way when there's any regression introduced we can catch it earlier. How difficult / feasible is it for you guys to upgrade jemalloc? Probably no need to do it very frequently but once every couple of years could be worth it. Let us know if anything we can make it easier for you too. |
Beta Was this translation helpful? Give feedback.
-
Hi @interwq, I've reworked (and updated) our jemalloc extension in #11891 Thanks again for reaching out to us. Our new defs file can be found here. I've also modified the config string:
I've attached two stats outputs, for two different queries of a benchmark. This is with background threads disabled. I can also re-run with background threads enabled if that gets us more information. Is this useful information? Do you know if there's anything to gain for DuckDB by changing the config? |
Beta Was this translation helpful? Give feedback.
-
This is looking much better:
It looks like we're no longer spending nearly as much time fadvising stuff away. There's still a bit of system CPU time, maybe related to trying to have a lot of threads fault in memory for individual zstd buffers. But the majority of time is userspace which is good. You might wish to optimize duckdb::Utf8Proc::Analyze. This came up as a top function both in the regex example above and a more simple substring search query. https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ has a good background on potential for optimization and https://github.com/simdutf/simdutf appears to be a productionized implementation |
Beta Was this translation helpful? Give feedback.
-
@lnkuiper Thanks for sharing the stats. They look reasonable overall. A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I discuss a bit with @interwq. It looks like there was a missing "overcommit" option that caused excessive mmaping. I was able to enable it with
I also noticed that there was still enough libc malloc usage to get a speedup by LD_PRELAOD'ing jemalloc. Just as a suggestion here — I think it generally makes sense to simply have the duckdb binary statically link jemalloc in the default configuration (no prefixes, taking over libc malloc). When duckdb is used as a shared library, I'd generally recommend simply using malloc and asking the user to choose a good implementation. While it's possible to embed jemalloc just for duckdb allocations, I think it's generally going to be easier to profile tune and debug when there's one allocator in a process. |
Beta Was this translation helpful? Give feedback.
-
Just wanted to give you a big thanks again @interwq and @bmaurer. You've been of great help in configuring jemalloc properly for DuckDB, and we've been able to get performance improvements across the board in many workloads, which means a lot to us. Thanks! ❤️ |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm a jemalloc dev. We noticed that the jemalloc retain option is disabled in duckdb:
duckdb/extension/jemalloc/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h
Line 264 in f0c47c1
Any particular reason like 32-bit compatibility or concerns about VM sizes? I'm wondering if there's anything we can improve from the jemalloc side as well.
Beta Was this translation helpful? Give feedback.
All reactions