Question: jemalloc opt.retain disabled? #11455

interwq · 2024-04-01T21:20:08Z

interwq
Apr 1, 2024

Hello, I'm a jemalloc dev. We noticed that the jemalloc retain option is disabled in duckdb:

duckdb/extension/jemalloc/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h

Line 264 in f0c47c1

/* #undef JEMALLOC_RETAIN */

Any particular reason like 32-bit compatibility or concerns about VM sizes? I'm wondering if there's anything we can improve from the jemalloc side as well.

lnkuiper · 2024-04-02T14:30:00Z

lnkuiper
Apr 2, 2024
Collaborator

Hi @interwq. DuckDB does not use MMAP, would it make sense to enable retain? I'm not sure if I understand what exactly the setting does.

We've recently had an issue with jemalloc on ARM64 because LG_PAGE is statically compiled in, therefore the DuckDB binary is not portable, so we've had to disable jemalloc for ARM builds. We now only enable jemalloc on x86 Linux builds. What happened was that we'd run into a bad alloc after some time on ARM64 because we set LG_PAGE to 16, but an AWS EC2 Graviton box should've had an LG_PAGE of 12. We found out that the issue was solved by increasing vm.max_map_count, but DuckDB should run anywhere, so we can't rely on this stuff.

In general, we like jemalloc as it is gives us less memory fragmentation and better allocation performance than glibc's allocator, but we struggle with portability. I've tried to make the code as portable as possible, with my limited understanding of the library. I haven't been able to get it to work for Windows (I've fiddled with the library quite a bit to make it compile as C++11, and couldn't get something with pragma comment(linker, ...) to work in tsd.cpp)

I've looked into mozilla's mozjemalloc, as it has made LG_PAGE a runtime variable rather than a statically compiled one, but I have not been able to get it to work for DuckDB as we want to compile with C++11, while mozjemalloc contains more modern C++.

3 replies

bmaurer Apr 2, 2024

Just as context here, I was playing around with duckdb using a schema that had a bunch of log files in it. I used ZSTD compression to generate the database. I noticed that system CPU time was very high. Using perf it seems that jemalloc was constantly mmaping and unmmaping memory regions.

$  MALLOC_CONF="" time  ./duckdb -s "select count(*) ...
99.99user 188.51system 0:10.61elapsed 2718%CPU (0avgtext+0avgdata 5426204maxresident)k
0inputs+0outputs (0major+11344194minor)pagefaults 0swaps

In Linux, it's a long-documented issue that multiple threads doing mmaps at the same time can cause lock contention. I was able to somewhat mitigate this issue by setting a large oversize_threshold getting a 3x (!!) improvement in performance. However, system CPU time was still high and there were still a number of mmap operations.

$  MALLOC_CONF="oversize_threshold:67108864" time  ./duckdb -s "select count(*) from 'all.parquet' where contains(msg, 'Fetched');"
...
99.61user 33.98system 0:02.73elapsed 4893%CPU (0avgtext+0avgdata 9615140maxresident)k
0inputs+0outputs (0major+3715382minor)pagefaults 0swaps

There are some recent(ish) improvements to jemalloc on this topic (jemalloc/jemalloc#2466)

I'm sure @interwq and the rest of the jemalloc team would be happy to see what they can do to address the portability issues you mention and make it easier for you to use an updateable version of jemalloc, if that's something that'd be of interest.

lnkuiper Apr 3, 2024
Collaborator

That's an interesting observation! I was unaware that MMAP is still used despite not calling it explicitly. I'm curious: how would I be able to reproduce this? Is it just ZSTD compressed Parquet files? On which OS/hardware was this done?

The improvement you linked, jemalloc/jemalloc#2466, is interesting but not applicable to DuckDB since we do not enable background threads.

lnkuiper Apr 4, 2024
Collaborator

Something I also wonder is if we have configured jemalloc optimally for our workload. We mostly allocate blocks of 256k. Is there anything we can do to improve that?

bmaurer · 2024-04-04T15:51:09Z

bmaurer
Apr 4, 2024

That's an interesting observation! I was unaware that MMAP is still used despite not calling it explicitly. I'm curious: how would I be able to reproduce this? Is it just ZSTD compressed Parquet files? On which OS/hardware was this done?

Yes, it's just zstd compressed parquet files with a ~ 100-500 byte varchar column. Total DB size was about 1.7 GB (compressed)

Done on x86-64 LInux

Kernel 5.19.0
OS CentOS Stream 9
223GiB RAM
166 processor

The improvement you linked, jemalloc/jemalloc#2466, is interesting but not applicable to DuckDB since we do not enable background threads.

HIghly recommend enabling the background thread. It's a key performance mechanism for jemalloc. https://dl.acm.org/doi/10.1145/2742580.2742807 talks about the history and rationale behind the background thread. Especially with tweaks like the improvement above, it makes a huge difference in malloc perf

0 replies

interwq · 2024-04-04T18:35:51Z

interwq
Apr 4, 2024
Author

@lnkuiper thanks for sharing the insights!

re: opt.retain option. It affects how jemalloc uses mmap / munmap: when enabled, jemalloc will never munmap any pages, but rather use madvise(..., MADV_DONTNEED) to return physical memory to OS while keeping the VM. This gives jemalloc more control over the VM layout for better efficiency, at the cost of caching the VM around, which causes the VSIZE to be noticeably higher than RSS (that can be a source of confusion to the end users).

This feature was in fact motivated by the vm.max_map_count issue you observed: by allowing jemalloc to cache and reuse VM, we can lower the number of individual mappings significantly (often by 10x or more). The higher LG_PAGE setting should work fine on the runtime systems with lower page sizes, i.e. the LG_PAGE=16 is expected to work on the 4K page systems. I suspect that the LG_PAGE setting just happens to affect the map_count a bit but it's not truly fixing the issue. It's likely that the LG_PAGE=16 setting will work fine if opt.retain is enabled.

Like Ben mentioned, background_thread is recommended since it helps pretty much all cases. In addition to the efficiency gains, it also limits the worst case behavior, e.g. it guarantees progress on dirty page purging, even if the application threads all go sleeping. You can also control the number of threads used with the max_background_thread option if that's a concern.

We will be happy to suggest some other tuning options, given that you have a specific workload, it would also help if you can share the malloc_stats output from a typical run (e.g. using either opt.stats_interval or opt.stats_print)

1 reply

lnkuiper Apr 5, 2024
Collaborator

Thanks for the explanation! I will enable opt.retain in that case.

We looked into the background thread, but we decided not to enable it because DuckDB is an in-process database system, not a database server. Therefore, we do not want to use resources when idle. Furthermore, our users expect to be able to close() a DB connection, which should free up all resources (memory and threads).

I'm using the following config string:

"narenas:%llu,dirty_decay_ms:1000,muzzy_decay_ms:1000"

Where %llu is the number of CPUs. I've tried using percpu_arena, but this didn't work on all architectures for some reason, so I've used this instead. I don't know if this is the best option. I've done this because we do not want to over-allocate, as this causes the process to be killed in Docker containers for some of our users. Having one arena per thread allows each thread to purge its arena after finishing a task (ensuring we minimize resource utilization when idle).

Note that we also do not use jemalloc as our global allocator, just for allocations that go through our Allocator, so essentially only for malloc calls. Again, this decision stems from DuckDB being an in-process database system. DuckDB may be active as a library within Python in the same process as another library that uses jemalloc, which would cause conflicts if both use jemalloc as a global allocator.

I'll look into the malloc_stats when I find the time. Again, thanks for your input!

interwq · 2024-04-09T20:57:15Z

interwq
Apr 9, 2024
Author

@lnkuiper re: background thread, for the context that option can be enabled / disabled on the fly via mallctl, in case you are more concerned about freeing up all resources after close(). But I do see you point that people may not expect any activity when they are not actively calling down to the DB. FWIW in addition to the CPU reduction mentioned, it often improve latency as well (especially the tail latency) because the expensive syscalls get offloaded from the application threads.

The malloc conf you shared looks very reasonable to me. The other option I want to mention is the heap profiling feature, which is a sample based approach for memory usage profiling in case you haven't tried.

About the portability issues, would you consider running your CIs on jemalloc's dev branch (https://github.com/jemalloc/jemalloc/tree/dev)? That way when there's any regression introduced we can catch it earlier.

How difficult / feasible is it for you guys to upgrade jemalloc? Probably no need to do it very frequently but once every couple of years could be worth it. Let us know if anything we can make it easier for you too.

2 replies

lnkuiper Apr 12, 2024
Collaborator

Hi @interwq, enabling/disabling the background thread is great! This allows us to make it configurable for our users. I will look into this, as well as upgrading jemalloc.

Using jemalloc's dev branch for CI is tricky, as it is not straightforward to integrate jemalloc for us. However, since we first integrated jemalloc, we have made changes to how DuckDB's extensions work, so it should be easier now.

I am quite busy at the moment, so I can't get this going right now, but I hope to get back to you soon.

Again, thank you for reaching out, this is of great help.

interwq Apr 12, 2024
Author

Glad to help @lnkuiper ! Feel free to let us know if anything.

I'll leave this thread open in case you want to track it.

lnkuiper · 2024-05-06T09:27:43Z

lnkuiper
May 6, 2024
Collaborator

Hi @interwq, I've reworked (and updated) our jemalloc extension in #11891

Thanks again for reaching out to us.

Our new defs file can be found here.

I've also modified the config string:

metadata_thp:always,oversize_threshold:268435456,dirty_decay_ms:10000,muzzy_decay_ms:10000,narenas:<number of cpus>,max_background_threads:<number of cpus divided by 32>"

I've attached two stats outputs, for two different queries of a benchmark. This is with background threads disabled. I can also re-run with background threads enabled if that gets us more information. Is this useful information? Do you know if there's anything to gain for DuckDB by changing the config?
jemalloc_stats_15.txt
jemalloc_stats_18.txt

0 replies

bmaurer · 2024-05-08T16:53:27Z

bmaurer
May 8, 2024

This is looking much better:

[[email protected] ~]$ time ./duckdb.new -c "select regexp_extract(msg, ' ([A-Za-z0-9_]+\.(cpp|h|c|py):\d+)]', 1) as fileline, count(*) from 'all.parquet' group by fileline;" > /dev/null

real	0m4.513s
user	4m36.059s
sys	0m40.066s
[[email protected] ~]$ time ./duckdb -c "select regexp_extract(msg, ' ([A-Za-z0-9_]+\.(cpp|h|c|py):\d+)]', 1) as fileline, count(*) from 'all.parquet' group by fileline;" > /dev/null

real	0m12.203s
user	3m15.770s
sys	3m5.011s

It looks like we're no longer spending nearly as much time fadvising stuff away. There's still a bit of system CPU time, maybe related to trying to have a lot of threads fault in memory for individual zstd buffers. But the majority of time is userspace which is good.

You might wish to optimize duckdb::Utf8Proc::Analyze. This came up as a top function both in the regex example above and a more simple substring search query. https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ has a good background on potential for optimization and https://github.com/simdutf/simdutf appears to be a productionized implementation

1 reply

lnkuiper May 13, 2024
Collaborator

Thanks for re-running the query with the new settings! Great to see the improvements, and thanks for the UTF8 suggestion!

interwq · 2024-05-08T17:18:11Z

interwq
May 8, 2024
Author

@lnkuiper Thanks for sharing the stats. They look reasonable overall. A few thoughts:

metadata_thp helps most for small / slab allocations, where metadata lookups cost matter more. From the stats it doesn't seem it can help much for this workload, and could add a small memory overhead. So maybe remove this setting.
The cache-oblivious setting. Since there are a fair number of large allocations (>=16K), plus you may use >4K pages. I'd suggest disabling it (i.e. configure with --disable-cache-oblivious), as it saves 1 physical page per large allocation. It should save several percent of overall memory usage, and recent-ish hardware don't seem benefit from it much anymore.

duckdb/extension/jemalloc/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h

Line 313 in ff29e57

#define JEMALLOC_CACHE_OBLIVIOUS
The lock stats show that even with this benchmark workload, there's barely any contention at the arena level -- this means you can possibly lower narenas further (e.g. half of core count), which saves memory w/o introducing much overhead.

1 reply

lnkuiper May 13, 2024
Collaborator

Thanks for the suggestions. I re-ran the workload without metadata_tph and without JEMALLOC_CACHE_OBLIVIOUS, and the performance was the same, so it seems that these can be safely disabled without affecting performance.

Lowering narenas to half did affect performance of the first query (in thejemalloc_stats_15.txt file), so I've left this setting unchanged.

bmaurer · 2024-05-08T19:34:28Z

bmaurer
May 8, 2024

I discuss a bit with @interwq. It looks like there was a missing "overcommit" option that caused excessive mmaping. I was able to enable it with

 --- a/extension/jemalloc/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h
+++ b/extension/jemalloc/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h
@@ -338,7 +338,7 @@
  * JEMALLOC_SYSCTL_VM_OVERCOMMIT: FreeBSD's vm.overcommit sysctl.
  */
 /* #undef JEMALLOC_SYSCTL_VM_OVERCOMMIT */
-/* #undef JEMALLOC_PROC_SYS_VM_OVERCOMMIT_MEMORY */
+#define JEMALLOC_PROC_SYS_VM_OVERCOMMIT_MEMORY

 /* Defined if madvise(2) is available. */
 #define JEMALLOC_HAVE_MADVISE

I also noticed that there was still enough libc malloc usage to get a speedup by LD_PRELAOD'ing jemalloc. Just as a suggestion here — I think it generally makes sense to simply have the duckdb binary statically link jemalloc in the default configuration (no prefixes, taking over libc malloc). When duckdb is used as a shared library, I'd generally recommend simply using malloc and asking the user to choose a good implementation. While it's possible to embed jemalloc just for duckdb allocations, I think it's generally going to be easier to profile tune and debug when there's one allocator in a process.

2 replies

lnkuiper May 13, 2024
Collaborator

Thanks! This setting could indeed be safely enabled without affecting performance.

DuckDB's primary use case is as a shared library, as we're an in-process database system. I would love to statically link jemalloc, or let jemalloc override C++ new/delete, but this creates conflict when DuckDB is used as a Python package.

Leaving the responsibility of choosing an allocator to the user would be best, but most of our users simply do pip install duckdb and then run queries, without knowing that different allocators exist. We want to be able to offer the best performance we can to our average user, that's why we've chosen to implement jemalloc in DuckDB in this way.

At some point in the future I will try to mess around with our shared library builds to find a way to override new/delete without conflicting with Python's allocator.

bmaurer May 13, 2024

At some point in the future I will try to mess around with our shared library builds to find a way to override new/delete without conflicting with Python's allocator.

The only way to overload new/malloc/etc in a shared library is if you can be assured that the library is loaded with LD_PRELOAD. Otherwise, the application may already have allocated memory in the default allocator before you override it.

The "fast by default" principle makes sense. Given that I'd suggest:

When you build a standalone duckdb binary, have duckdb use the default allocator. Statically link jemalloc in the standard configuration
By default, when building duckdb as a library, bundle the prefixed jemalloc so that you are not subject to the default allocator in the program
As an option, and recommended for high performance use cases, have duckdb use the programs' default allocator, and ask the user to use a good one (like jemalloc)

lnkuiper · 2024-05-13T08:55:38Z

lnkuiper
May 13, 2024
Collaborator

Just wanted to give you a big thanks again @interwq and @bmaurer. You've been of great help in configuring jemalloc properly for DuckDB, and we've been able to get performance improvements across the board in many workloads, which means a lot to us.

Thanks! ❤️

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: jemalloc opt.retain disabled? #11455

{{title}}

Replies: 9 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question: jemalloc opt.retain disabled? #11455

interwq Apr 1, 2024

Replies: 9 comments · 10 replies

lnkuiper Apr 2, 2024 Collaborator

bmaurer Apr 2, 2024

lnkuiper Apr 3, 2024 Collaborator

lnkuiper Apr 4, 2024 Collaborator

bmaurer Apr 4, 2024

interwq Apr 4, 2024 Author

lnkuiper Apr 5, 2024 Collaborator

interwq Apr 9, 2024 Author

lnkuiper Apr 12, 2024 Collaborator

interwq Apr 12, 2024 Author

lnkuiper May 6, 2024 Collaborator

bmaurer May 8, 2024

lnkuiper May 13, 2024 Collaborator

interwq May 8, 2024 Author

lnkuiper May 13, 2024 Collaborator

bmaurer May 8, 2024

lnkuiper May 13, 2024 Collaborator

bmaurer May 13, 2024

lnkuiper May 13, 2024 Collaborator

interwq
Apr 1, 2024

Replies: 9 comments 10 replies

lnkuiper
Apr 2, 2024
Collaborator

lnkuiper Apr 3, 2024
Collaborator

lnkuiper Apr 4, 2024
Collaborator

bmaurer
Apr 4, 2024

interwq
Apr 4, 2024
Author

lnkuiper Apr 5, 2024
Collaborator

interwq
Apr 9, 2024
Author

lnkuiper Apr 12, 2024
Collaborator

interwq Apr 12, 2024
Author

lnkuiper
May 6, 2024
Collaborator

bmaurer
May 8, 2024

lnkuiper May 13, 2024
Collaborator

interwq
May 8, 2024
Author

lnkuiper May 13, 2024
Collaborator

bmaurer
May 8, 2024

lnkuiper May 13, 2024
Collaborator

lnkuiper
May 13, 2024
Collaborator