[DOC] Recommend async memory resource more strongly as a default

## Report needed documentation

Users frequently choose a memory resource without a lot of guidance or insight on performance. RMM doesn't do enough to help users identify the "best" options. Some other issues like #1694, #2015, #2033 also illuminate this need.

We should encourage the use of the async memory resource ([rmm.mr.CudaAsyncMemoryResource](https://docs.rapids.ai/api/rmm/nightly/python_api/#rmm.mr.CudaAsyncMemoryResource)) by default (assuming no prior knowledge of the application's usage patterns). In general, the async MR will be one of the best choices for several reasons.

The async memory resource is backed by a driver-managed pool allocator (behind `cudaMallocAsync`, there is a default pool). This means that the driver is able to suballocate efficiently (one of the main performance benefits of using `PoolMemoryResource`) but it doesn't suffer from the same limitations as RMM's `PoolMemoryResource`. For example, the driver-managed pool can use virtual addressing to remap physical addresses and avoid problems with fragmentation.

Another key benefit is that the driver-managed pool can be shared by multiple applications, including those not using RMM. To share an RMM pool memory resource, every application has to support RMM and be configured to use it. This is really important for use cases involving, for example, cuDF and PyTorch. PyTorch can be configured to use RMM but not all developers know to do this. That can result in developers partitioning GPU memory space between libraries (e.g. half for cuDF, half for PyTorch) rather than sharing it.

The RMM documentation may be leading people in the wrong direction for choosing defaults. The user guide discusses pool memory resources here, but doesn't give the advice to use `CudaAsyncMemoryResource` wherever possible.
https://github.com/rapidsai/rmm/blob/163c84a5c5f11ef20b26c3740782189456167e5c/docs/guide.md?plain=1#L140-L141

As a result, users may think that an RMM pool is needed to achieve good performance, when the driver-managed pool has similar performance and can be an even better choice for the reasons stated above.

Finally, the importance of choosing an async resource is growing as we see more multithreaded, multi-stream applications using RMM. See conversation below with @JigaoLuo for the motivating example for this issue, which observed "pipeline bubbles" until adopting the async MR.

tl;dr:
> I just replaced CudaMemoryResource with CudaAsyncMemoryResource as the upstream in PoolMemoryResource, and I’m seeing less bubble time being gap in nsys. Thanks for the suggestion!
> I also strongly believe an educational blog post or an update to the README—explaining the different memory resources and typical use cases would be incredibly helpful for users navigating these choices.

<details><summary>Conversation</summary>

> **JLuo**
> Hi again, I (also) wanted to share a general KVIKIO performance observation as well a question at the end.
> In my setup with pipeling, I perform multithreaded KVIKIO GDS reads with MB-level chunks (acting as a producer), and then consume those chunks for computation (acting as a consumer). I’ve noticed that I/O midway through the read process, each KVIKIO read starts taking longer, eventually not able to saturate SSD bandwidth.
> I tried tuning KVIKIO (adding more threads and increasing I/O size) but it didn’t help.
> My working assumption, which I’ve partially verified, is that the slowdown stems from memory pressure & contention in the RMM pool when free memory is not enough. As the producer performs I/O and the consumer with intermediate results, memory availability drops, leading to increased latency on both ends.
> As a question for tooling&profiling, I’d appreciate any suggestions on how to fully verify this. Since everything runs asynchronously, it’s been hard to confirm.
> (Another possibility of this I/O slowness—though still speculative—is contention on the memory copy engine, similar to cuDF issue: https://github.com/rapidsai/cudf/issues/15620

> **Bradley Dice**
>> memory pressure & contention in the RMM pool when free memory is not enough
>
> Can you expand on this idea a bit? If there's insufficient free memory, new allocations exceeding that would just fail. What kind of contention are you thinking might exist?
> Can you also clarify what memory resource you're using? The CUDA async MR with a driver-managed pool? RMM's pool MR (with what base resource, CUDA [synchronous] MR or CUDA async MR?)

> **JLuo**
> Thanks. My code is in Python, and I’m using PoolMemoryResource with the default base.
> I’ll take a closer look at the base types and continue investigating the performance issue. I haven’t done full profiling yet today.

> **Bradley Dice**
> Try just the async MR. rmm.mr.CudaAsyncMemoryResource: https://docs.rapids.ai/api/rmm/nightly/python_api/#rmm.mr.CudaAsyncMemoryResource

> **JLuo**
> The performance issue seems to have resolved, though I still don’t know the root cause. Apologies for the noise—and thanks again for your support!

> **Bradley Dice**
> Okay! Good to hear. I am still interested in the performance you see with the async MR vs. the pool MR (with default base). We might explore getting rid of RMM’s built in pool, because the CUDA driver’s pool (working behind the scenes in the async MR) has several advantages. Virtual addressing, better compatibility when multiple applications are using GPU memory, etc.
> We think “choose async MR by default” is the direction we want to go for teaching users. 

> **JLuo**
> Thanks, I’ll give it a try. Are there any blogs or benchmarks you’d recommend? I’m trying to understand which one might best fit my use case.

> **Bradley Dice**
> 😄 I’m working on writing some blogs/docs if there seems to be consensus on good performance with async — and working with the driver team to address any issues that users observe.
> We want async to be good for a wide range of use cases, and that’s generally what we observe across a range of benchmarks. The main choices for “default” I see in practice are async for most applications, and a managed memory pool with prefetch-on-allocate if you need to be able to handle larger-than-memory problems.
> With multithreaded, multi-stream applications becoming more common, async (as opposed to sync) is really important.

> **JLuo**
> Thanks, I’ll definitely read it once it’s out. Yes I am using multithread with PTDS.
> Quick question about the “driver”—if you mean the CUDA driver, is there a specific version requirement? Due to setup and dependency constraints, I can only use up to version 12.8.

> **Bradley Dice**
> Generally no, I am not aware of performance being very sensitive to the choice of driver version.
> There are some new features in the CUDA 13 driver (580+) but I don’t think that matters yet. We are going to be adding some CUDA 13 driver features to RMM that might improve managed memory performance but they haven’t been implemented yet. CUDA 12 users will not be impacted.

> **JLuo**
> Thanks.
> I just replaced CudaMemoryResource with CudaAsyncMemoryResource as the upstream in PoolMemoryResource, and I’m seeing less bubble time being gap in nsys. Thanks for the suggestion!
> I also strongly believe an educational blog post or an update to the README—explaining the different memory resources and typical use cases would be incredibly helpful for users navigating these choices.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DOC] Recommend async memory resource more strongly as a default #2035

Report needed documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	>>> pool = rmm.mr.PoolMemoryResource(
	... rmm.mr.CudaMemoryResource(),

[DOC] Recommend async memory resource more strongly as a default #2035

Description

Report needed documentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions