-
Notifications
You must be signed in to change notification settings - Fork 233
Description
Report needed documentation
Users frequently choose a memory resource without a lot of guidance or insight on performance. RMM doesn't do enough to help users identify the "best" options. Some other issues like #1694, #2015, #2033 also illuminate this need.
We should encourage the use of the async memory resource (rmm.mr.CudaAsyncMemoryResource) by default (assuming no prior knowledge of the application's usage patterns). In general, the async MR will be one of the best choices for several reasons.
The async memory resource is backed by a driver-managed pool allocator (behind cudaMallocAsync, there is a default pool). This means that the driver is able to suballocate efficiently (one of the main performance benefits of using PoolMemoryResource) but it doesn't suffer from the same limitations as RMM's PoolMemoryResource. For example, the driver-managed pool can use virtual addressing to remap physical addresses and avoid problems with fragmentation.
Another key benefit is that the driver-managed pool can be shared by multiple applications, including those not using RMM. To share an RMM pool memory resource, every application has to support RMM and be configured to use it. This is really important for use cases involving, for example, cuDF and PyTorch. PyTorch can be configured to use RMM but not all developers know to do this. That can result in developers partitioning GPU memory space between libraries (e.g. half for cuDF, half for PyTorch) rather than sharing it.
The RMM documentation may be leading people in the wrong direction for choosing defaults. The user guide discusses pool memory resources here, but doesn't give the advice to use CudaAsyncMemoryResource wherever possible.
Lines 140 to 141 in 163c84a
| >>> pool = rmm.mr.PoolMemoryResource( | |
| ... rmm.mr.CudaMemoryResource(), |
As a result, users may think that an RMM pool is needed to achieve good performance, when the driver-managed pool has similar performance and can be an even better choice for the reasons stated above.
Finally, the importance of choosing an async resource is growing as we see more multithreaded, multi-stream applications using RMM. See conversation below with @JigaoLuo for the motivating example for this issue, which observed "pipeline bubbles" until adopting the async MR.
tl;dr:
I just replaced CudaMemoryResource with CudaAsyncMemoryResource as the upstream in PoolMemoryResource, and I’m seeing less bubble time being gap in nsys. Thanks for the suggestion!
I also strongly believe an educational blog post or an update to the README—explaining the different memory resources and typical use cases would be incredibly helpful for users navigating these choices.
Conversation
JLuo
Hi again, I (also) wanted to share a general KVIKIO performance observation as well a question at the end.
In my setup with pipeling, I perform multithreaded KVIKIO GDS reads with MB-level chunks (acting as a producer), and then consume those chunks for computation (acting as a consumer). I’ve noticed that I/O midway through the read process, each KVIKIO read starts taking longer, eventually not able to saturate SSD bandwidth.
I tried tuning KVIKIO (adding more threads and increasing I/O size) but it didn’t help.
My working assumption, which I’ve partially verified, is that the slowdown stems from memory pressure & contention in the RMM pool when free memory is not enough. As the producer performs I/O and the consumer with intermediate results, memory availability drops, leading to increased latency on both ends.
As a question for tooling&profiling, I’d appreciate any suggestions on how to fully verify this. Since everything runs asynchronously, it’s been hard to confirm.
(Another possibility of this I/O slowness—though still speculative—is contention on the memory copy engine, similar to cuDF issue: rapidsai/cudf#15620
Bradley Dice
memory pressure & contention in the RMM pool when free memory is not enough
Can you expand on this idea a bit? If there's insufficient free memory, new allocations exceeding that would just fail. What kind of contention are you thinking might exist?
Can you also clarify what memory resource you're using? The CUDA async MR with a driver-managed pool? RMM's pool MR (with what base resource, CUDA [synchronous] MR or CUDA async MR?)
JLuo
Thanks. My code is in Python, and I’m using PoolMemoryResource with the default base.
I’ll take a closer look at the base types and continue investigating the performance issue. I haven’t done full profiling yet today.
Bradley Dice
Try just the async MR. rmm.mr.CudaAsyncMemoryResource: https://docs.rapids.ai/api/rmm/nightly/python_api/#rmm.mr.CudaAsyncMemoryResource
JLuo
The performance issue seems to have resolved, though I still don’t know the root cause. Apologies for the noise—and thanks again for your support!
Bradley Dice
Okay! Good to hear. I am still interested in the performance you see with the async MR vs. the pool MR (with default base). We might explore getting rid of RMM’s built in pool, because the CUDA driver’s pool (working behind the scenes in the async MR) has several advantages. Virtual addressing, better compatibility when multiple applications are using GPU memory, etc.
We think “choose async MR by default” is the direction we want to go for teaching users.
JLuo
Thanks, I’ll give it a try. Are there any blogs or benchmarks you’d recommend? I’m trying to understand which one might best fit my use case.
Bradley Dice
😄 I’m working on writing some blogs/docs if there seems to be consensus on good performance with async — and working with the driver team to address any issues that users observe.
We want async to be good for a wide range of use cases, and that’s generally what we observe across a range of benchmarks. The main choices for “default” I see in practice are async for most applications, and a managed memory pool with prefetch-on-allocate if you need to be able to handle larger-than-memory problems.
With multithreaded, multi-stream applications becoming more common, async (as opposed to sync) is really important.
JLuo
Thanks, I’ll definitely read it once it’s out. Yes I am using multithread with PTDS.
Quick question about the “driver”—if you mean the CUDA driver, is there a specific version requirement? Due to setup and dependency constraints, I can only use up to version 12.8.
Bradley Dice
Generally no, I am not aware of performance being very sensitive to the choice of driver version.
There are some new features in the CUDA 13 driver (580+) but I don’t think that matters yet. We are going to be adding some CUDA 13 driver features to RMM that might improve managed memory performance but they haven’t been implemented yet. CUDA 12 users will not be impacted.
JLuo
Thanks.
I just replaced CudaMemoryResource with CudaAsyncMemoryResource as the upstream in PoolMemoryResource, and I’m seeing less bubble time being gap in nsys. Thanks for the suggestion!
I also strongly believe an educational blog post or an update to the README—explaining the different memory resources and typical use cases would be incredibly helpful for users navigating these choices.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status