[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to `rmm::device_buffer` #1996

JigaoLuo · 2025-07-23T07:06:57Z

Description

This is an initial draft for Issue #1959 and #1955 that adds a host memory resource for a pinned bounce buffer into rmm::device_buffer.

We’ve had discussions in the previous PR draft #1985 and even earlier in cuDF rapidsai/cudf#19119 . And this version is now into rmm::device_buffer.

I’m aware this draft is still far from being ready to merge. But I believe it’s a good starting point, and I’d really appreciate your feedback as I continue refining it. I hope it sparks discussion (again 😃 )—happy to revisit it if changes are needed. Just aiming to get the conversation started.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-07-23T07:07:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

JigaoLuo · 2025-07-23T07:11:01Z

cpp/src/device_buffer.cpp

-    _device{other._device}
+    _device{other._device},
+    _host_mr{other._host_mr},
+    _host_bounce_buffer{other._host_bounce_buffer}


To Line 90: I’ll also start with a simple discussion point: I believe the bounce buffer is stateless—it doesn’t store anything and serves only as a temporary transfer buffer. If you agree, I’d prefer to leave it uncopied or moved, to reflect its transient nature.

It doesn't store anything, but it is an allocation that's a part of the object's state.
When we copy an object, I assume we want the new one to have the equivalent bounce buffer. So, we don't need to copy, but still need to allocate a new one.
When we move, reusing the old allocation should be the cheapest option, even though we don't care about the content.

JigaoLuo · 2025-07-23T07:11:56Z

cpp/src/device_buffer.cpp

+    _mr                 = other._mr;
+    _device             = other._device;
+    _host_mr            = other._host_mr;
+    _host_bounce_buffer = other._host_bounce_buffer;


To Line 114: Same reasoning as my earlier point about statelessness.

JigaoLuo · 2025-07-23T07:45:07Z

cpp/tests/device_buffer_tests.cu

Feel free to skip reviewing the unit tests for now.

harrism · 2025-07-23T20:50:54Z

I think that it's important to document the increased memory usage that happens if the user passes a host_mr. It effectively doubles -- half on device_mr and half on host_mr. And I think the need for this is variable. For example, if this buffer backs a device_uvector, and you use it for an output and then do a reduction on it in place, you may only want to read a single element back to the host. So it's overkill to allocate host memory for the whole thing.

So I'm wondering if the host allocation should be lazy -- only allocate what is needed for a copy when it is needed? Performance will rely on a fast host_mr in that case.

This may need more experimentation.

JigaoLuo · 2025-07-27T19:20:43Z

Hi @harrism , Thanks for your thoughtful comments, and apologies for the delayed reply—I’ve been tied up with teaching duties over the past few days. I see a few key points you raised:

Point 1. Host memory allocation size, even with flexible sizing in advance
Point 2. Lazy allocation scoped to copy operations, with later deallocation

Let me share some initial thoughts:

On point 0 (as the starting point)

Regarding performance improvements, I’ve been working on a PR draft with benchmarking in cuDF rapidsai/cudf#18968. As shown in my plot, removing pageable-host memory copies appears to be the most reasonable path toward better scalable performance. I also covered why I am doing this in this issue for future readers: rapidsai/cudf#18967

On point 1

In the current implementation, host memory is optional—leaving it to the RMM-user to decide whether to trade off more pinned memory for performance, especially for device_buffers with the case of host communication.

I fully agree we should run experiments to understand how much additional memory is used. For that, we’d need a suitable workload where some device_buffers interact with the host and others don’t. That is because not every device_buffers needs host communication. I’m not entirely sure what benchmark would best reflect this—perhaps a Parquet reading benchmark again?

As for flexible sizing, I agree in principle. However, from a design perspective, coupling buffer logic with future reduction logic may not be ideal. I believe we should aim for a more abstract separation between buffer (memory management) and computation logic (e.g., reduction).

On point 2

Lazy allocation scoped to copy operations sounds like a solid idea and should be feasible from an engineering standpoint. I’ll try drafting something. In my parallelism case with multiple CUDA streams, I don’t expect a major bandwidth difference—though latency benchmarks might show some variation.

Overall, I’d be happy to collaborate on a full benchmark. That said, I think we need a more structured discussion to align on a plan, as there are quite a few directions to explore. Once we reach consensus, I’d be confident moving forward with experiments to demonstrate that this approach can deliver meaningful speedups—like the 20% improvement we saw in the Parquet benchmark in rapidsai/cudf#18968 —and eventually a merge of this PR draft.

wence- · 2025-08-01T14:10:38Z

cpp/src/device_buffer.cpp

+      if (result == cudaSuccess && attributes.type == cudaMemoryTypeHost) {
+        RMM_CUDA_TRY(cudaMemcpyAsync(
+          _host_bounce_buffer.value(), source, bytes, cudaMemcpyHostToHost, stream().value()));
+        RMM_CUDA_TRY(cudaMemcpyAsync(
+          _data, _host_bounce_buffer.value(), bytes, cudaMemcpyHostToDevice, stream().value()));
+        return;


If the user's source pointer is in page-locked host memory this is a pessimisation, I think.

this branch is only executes when user passes a host mr, with the expectation that this would be some kind of pinned memory allocator. When the host mr is not set (the default case), copy is done the same way it was before this PR.

wence- · 2025-08-01T16:29:33Z

I tried reading through the past discussions.

My gut feeling is that I don't think RMM should be this opinionated about how copies are performed between host and device memory. For example, an application might know that it has a data access pattern that only requires the use of a single pinned bounce buffer and want to manage things manually. Rather than here where as soon as you pass a host MR you're committing to using that as a bounce buffer.

Relatedly I think, like @harrism, that eagerly releasing/reallocating a bounce buffer on every resize is probably a pessimisation. You might be updating the device data and not intend to copy it back to host. I think if we do want a model like this, we should only ensure the bounce buffer is appropriately sized when a copy is requested.

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

WDYT?

vuule · 2025-08-01T16:48:41Z

Relatedly I think, like @harrism, that eagerly releasing/reallocating a bounce buffer on every resize is probably a pessimisation. You might be updating the device data and not intend to copy it back to host. I think if we do want a model like this, we should only ensure the bounce buffer is appropriately sized when a copy is requested.

That is a good point. @bdice and I discussed an option of having a single-element bounce buffer in device_uvector to limit the overhead you described.

IMO this kind of optimization seems to fit in device_scalar, whose functionality revolves around passing a single value between host and device.

bdice · 2025-08-01T18:44:50Z

We’ve had some trouble deciding on the appropriate scope for this internally. Perhaps we should do the least intrusive thing and only change device_scalar as @vuule is suggesting. I recognize this is a reversal of what we previously considered in #1985 and I apologize that we’ve had trouble coming to a design consensus here.

I am getting things set up to run the multithreaded Parquet benchmarks from cuDF so that I can study the design and safety/usability/performance tradeoffs a little more.

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

@wence- Can you expand on this? I am not sure what this API looks like.

JigaoLuo · 2025-08-02T15:04:19Z

To everyone at NVIDIA, thank you for the insightful discussion and for sharing your thoughts here. I understand there have been even more detailed and thoughtful design conversations internally (it's unfortunate I couldn't be part of those).
I support your conclusions and will continue following this discussion for now.

bdice · 2025-08-02T15:55:56Z

@JigaoLuo The conversation has mostly been around what we really want -- and trying to keep from asking you to go back and forth on designs (should this go in device_scalar? device_buffer? device_uvector?) since we aren't entirely sure what we want to do here.

Some of the main points we are wrestling with are:

when the bounce buffer should be created? (eagerly? on first touch?)
how large should the bounce buffer be? (the size of the full buffer/uvector? or just large enough for one element to handle set_element_async? We don't know of many immediate use cases in the Parquet reader for a large pinned buffer because usually we are only getting/setting one element, like the first or last element)
what stream safety considerations are needed?
- If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:
  - 1. set value to 0 on stream [bounce buffer value is set, and an async copy is initiated]
  - 1. launch kernel on stream using value
  - 1. set value to 1 on stream [bounce buffer value is set, and an async copy is initiated]
      where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.
should this kind of buffering be done with subclass of device_{buffer/scalar/uvector}? It changes the design pretty significantly.
- If we implement this in device_buffer, the container is "raw bytes" so we don't have a concept like sizeof(T) that would help us inform the pinned buffer size if we decide that we don't want to have its size match the entire size of the device memory.

For now I think we are hoping to get rapidsai/cudf#19119 moved forward and implement something in libcudf, while continuing to evaluate what design would be most appropriate for RMM.

JigaoLuo · 2025-08-02T17:34:36Z

Thanks @bdice, and all of you involved.
That makes a lot of sense to me. I’ll postpone this PR draft implementation for now. Hopefully, I’ll return to this after wrapping up my story issue in libcudf.

wence- · 2025-08-04T10:51:24Z

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

@wence- Can you expand on this? I am not sure what this API looks like.

At the moment, if you want to use a pinned bounce buffer to stage copies you need to manage that yourself:

auto host_data = ...;
auto bounce_buffer = pinned(...);
cudamemcpyasync(bounce_buffer, host_data, stream);
rmm::device_buffer(bounce_buffer, size, stream, mr);

We also offer device_buffer::copy_async to copy out of a device buffer into a given user-provided pointer.

We could imagine providing copy_through_async(void *tgt, void *bounce, void const * src, ...) but maybe it is not worth it.

wence- · 2025-08-04T10:54:22Z

If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:
* 1. set `value` to `0` on `stream` [bounce buffer value is set, and an async copy is initiated]

* 2. launch kernel on `stream` using `value`

* 3. set `value` to `1` on `stream` [bounce buffer value is set, and an async copy is initiated]
     where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.

If all of the transactions on the pinned bounce buffer are carried out with cudaMemcpyHostToHost then I believe that the host-side copies are stream ordered so this is not an issue.

JigaoLuo · 2025-08-04T11:11:48Z

If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:
* 1. set `value` to `0` on `stream` [bounce buffer value is set, and an async copy is initiated]

* 2. launch kernel on `stream` using `value`

* 3. set `value` to `1` on `stream` [bounce buffer value is set, and an async copy is initiated]
     where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.
If all of the transactions on the pinned bounce buffer are carried out with cudaMemcpyHostToHost then I believe that the host-side copies are stream ordered so this is not an issue.

Thanks—that aligns perfectly with what we discussed in the comment inside my cuDF draft: rapidsai/cudf#18968 (comment)

bdice · 2025-08-04T18:43:49Z

@wence- @JigaoLuo Thanks, using cudaMemcpyAsync for stream-ordered host copies is a good solution for my concern about double-writes not being stream ordered. I talked about this with @vuule and that solution hadn't occurred to me.

JigaoLuo · 2025-08-04T19:12:08Z

That’s another moment I wish I could’ve joined the discussion—I see several points and concerns, especially since RMM is such a fundamental library and deserves careful consideration.

I’d propose to first merge my draft rapidsai/cudf#18968 into cuDF, and postpone this draft in RMM. The work in cuDF could serve as a pre-stage for refining RMM, especially since we also need support for pinned bounce buffers and host-to-host memcpy in cuDF. I hope we can discuss these aspects during the cuDF merge and use that as a springboard to plan what’s next for RMM. (I’m currently juggling a paper deadline, so I apologize in advance if there’s any delay. )

I’ve tried to address most of the key issues, including host-to-host memcpy (which I believe is the only viable solution). My draft rapidsai/cudf#18968 shows a 20% performance improvement and eliminates all CUDA async memcpy with pageable memory. I’ve shared everything I know and considered in the issue rapidsai/cudf#18967 as well, and I put real effort into the writing. I hope you found it helpful—and I’d be glad if you’d like to adapt the text.

bounce buffer in device_buffer

b4a92c9

JigaoLuo requested a review from a team as a code owner July 23, 2025 07:06

JigaoLuo requested review from vyasr and bdice July 23, 2025 07:06

github-project-automation bot added this to RMM Project Board Jul 23, 2025

JigaoLuo marked this pull request as draft July 23, 2025 07:07

JigaoLuo commented Jul 23, 2025

View reviewed changes

simple unit tests

b5da5e2

JigaoLuo commented Jul 23, 2025

View reviewed changes

cpp/tests/device_buffer_tests.cu

Copy link

Contributor Author

JigaoLuo Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to skip reviewing the unit tests for now.

wence- reviewed Aug 1, 2025

View reviewed changes

[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to rmm::device_buffer #1996

Are you sure you want to change the base?

[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to rmm::device_buffer #1996

Conversation

JigaoLuo commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jul 23, 2025

Uh oh!

JigaoLuo Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vuule Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

JigaoLuo Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JigaoLuo Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

harrism commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JigaoLuo commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On point 0 (as the starting point)

On point 1

On point 2

Uh oh!

wence- Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

vuule Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

wence- commented Aug 1, 2025

Uh oh!

vuule commented Aug 1, 2025

Uh oh!

bdice commented Aug 1, 2025

Uh oh!

JigaoLuo commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdice commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JigaoLuo commented Aug 2, 2025

Uh oh!

wence- commented Aug 4, 2025

Uh oh!

wence- commented Aug 4, 2025

Uh oh!

JigaoLuo commented Aug 4, 2025

Uh oh!

bdice commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JigaoLuo commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to `rmm::device_buffer` #1996

[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to `rmm::device_buffer` #1996

JigaoLuo commented Jul 23, 2025 •

edited

Loading

JigaoLuo Jul 23, 2025 •

edited

Loading

JigaoLuo Jul 23, 2025 •

edited

Loading

harrism commented Jul 23, 2025 •

edited

Loading

JigaoLuo commented Jul 27, 2025 •

edited

Loading

JigaoLuo commented Aug 2, 2025 •

edited

Loading

bdice commented Aug 2, 2025 •

edited

Loading

bdice commented Aug 4, 2025 •

edited

Loading

JigaoLuo commented Aug 4, 2025 •

edited

Loading