Skip to content

Conversation

JigaoLuo
Copy link
Contributor

@JigaoLuo JigaoLuo commented Jul 23, 2025

Description

This is an initial draft for Issue #1959 and #1955 that adds a host memory resource for a pinned bounce buffer into rmm::device_buffer.

We’ve had discussions in the previous PR draft #1985 and even earlier in cuDF rapidsai/cudf#19119 . And this version is now into rmm::device_buffer.

I’m aware this draft is still far from being ready to merge. But I believe it’s a good starting point, and I’d really appreciate your feedback as I continue refining it. I hope it sparks discussion (again 😃 )—happy to revisit it if changes are needed. Just aiming to get the conversation started.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@JigaoLuo JigaoLuo requested a review from a team as a code owner July 23, 2025 07:06
@JigaoLuo JigaoLuo requested review from vyasr and bdice July 23, 2025 07:06
Copy link

copy-pr-bot bot commented Jul 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JigaoLuo JigaoLuo marked this pull request as draft July 23, 2025 07:07
_device{other._device}
_device{other._device},
_host_mr{other._host_mr},
_host_bounce_buffer{other._host_bounce_buffer}
Copy link
Contributor Author

@JigaoLuo JigaoLuo Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Line 90: I’ll also start with a simple discussion point: I believe the bounce buffer is stateless—it doesn’t store anything and serves only as a temporary transfer buffer. If you agree, I’d prefer to leave it uncopied or moved, to reflect its transient nature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't store anything, but it is an allocation that's a part of the object's state.
When we copy an object, I assume we want the new one to have the equivalent bounce buffer. So, we don't need to copy, but still need to allocate a new one.
When we move, reusing the old allocation should be the cheapest option, even though we don't care about the content.

_mr = other._mr;
_device = other._device;
_host_mr = other._host_mr;
_host_bounce_buffer = other._host_bounce_buffer;
Copy link
Contributor Author

@JigaoLuo JigaoLuo Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Line 114: Same reasoning as my earlier point about statelessness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to skip reviewing the unit tests for now.

@harrism
Copy link
Member

harrism commented Jul 23, 2025

I think that it's important to document the increased memory usage that happens if the user passes a host_mr. It effectively doubles -- half on device_mr and half on host_mr. And I think the need for this is variable. For example, if this buffer backs a device_uvector, and you use it for an output and then do a reduction on it in place, you may only want to read a single element back to the host. So it's overkill to allocate host memory for the whole thing.

So I'm wondering if the host allocation should be lazy -- only allocate what is needed for a copy when it is needed? Performance will rely on a fast host_mr in that case.

This may need more experimentation.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Jul 27, 2025

Hi @harrism , Thanks for your thoughtful comments, and apologies for the delayed reply—I’ve been tied up with teaching duties over the past few days. I see a few key points you raised:

  • Point 1. Host memory allocation size, even with flexible sizing in advance
  • Point 2. Lazy allocation scoped to copy operations, with later deallocation

Let me share some initial thoughts:

On point 0 (as the starting point)

Regarding performance improvements, I’ve been working on a PR draft with benchmarking in cuDF rapidsai/cudf#18968. As shown in my plot, removing pageable-host memory copies appears to be the most reasonable path toward better scalable performance. I also covered why I am doing this in this issue for future readers: rapidsai/cudf#18967

On point 1

In the current implementation, host memory is optional—leaving it to the RMM-user to decide whether to trade off more pinned memory for performance, especially for device_buffers with the case of host communication.

I fully agree we should run experiments to understand how much additional memory is used. For that, we’d need a suitable workload where some device_buffers interact with the host and others don’t. That is because not every device_buffers needs host communication. I’m not entirely sure what benchmark would best reflect this—perhaps a Parquet reading benchmark again?

As for flexible sizing, I agree in principle. However, from a design perspective, coupling buffer logic with future reduction logic may not be ideal. I believe we should aim for a more abstract separation between buffer (memory management) and computation logic (e.g., reduction).

On point 2

Lazy allocation scoped to copy operations sounds like a solid idea and should be feasible from an engineering standpoint. I’ll try drafting something. In my parallelism case with multiple CUDA streams, I don’t expect a major bandwidth difference—though latency benchmarks might show some variation.


Overall, I’d be happy to collaborate on a full benchmark. That said, I think we need a more structured discussion to align on a plan, as there are quite a few directions to explore. Once we reach consensus, I’d be confident moving forward with experiments to demonstrate that this approach can deliver meaningful speedups—like the 20% improvement we saw in the Parquet benchmark in rapidsai/cudf#18968 —and eventually a merge of this PR draft.

Comment on lines +175 to +180
if (result == cudaSuccess && attributes.type == cudaMemoryTypeHost) {
RMM_CUDA_TRY(cudaMemcpyAsync(
_host_bounce_buffer.value(), source, bytes, cudaMemcpyHostToHost, stream().value()));
RMM_CUDA_TRY(cudaMemcpyAsync(
_data, _host_bounce_buffer.value(), bytes, cudaMemcpyHostToDevice, stream().value()));
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user's source pointer is in page-locked host memory this is a pessimisation, I think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is only executes when user passes a host mr, with the expectation that this would be some kind of pinned memory allocator. When the host mr is not set (the default case), copy is done the same way it was before this PR.

@wence-
Copy link
Contributor

wence- commented Aug 1, 2025

I tried reading through the past discussions.

My gut feeling is that I don't think RMM should be this opinionated about how copies are performed between host and device memory. For example, an application might know that it has a data access pattern that only requires the use of a single pinned bounce buffer and want to manage things manually. Rather than here where as soon as you pass a host MR you're committing to using that as a bounce buffer.

Relatedly I think, like @harrism, that eagerly releasing/reallocating a bounce buffer on every resize is probably a pessimisation. You might be updating the device data and not intend to copy it back to host. I think if we do want a model like this, we should only ensure the bounce buffer is appropriately sized when a copy is requested.

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

WDYT?

@vuule
Copy link

vuule commented Aug 1, 2025

Relatedly I think, like @harrism, that eagerly releasing/reallocating a bounce buffer on every resize is probably a pessimisation. You might be updating the device data and not intend to copy it back to host. I think if we do want a model like this, we should only ensure the bounce buffer is appropriately sized when a copy is requested.

That is a good point. @bdice and I discussed an option of having a single-element bounce buffer in device_uvector to limit the overhead you described.

IMO this kind of optimization seems to fit in device_scalar, whose functionality revolves around passing a single value between host and device.

@bdice
Copy link
Contributor

bdice commented Aug 1, 2025

We’ve had some trouble deciding on the appropriate scope for this internally. Perhaps we should do the least intrusive thing and only change device_scalar as @vuule is suggesting. I recognize this is a reversal of what we previously considered in #1985 and I apologize that we’ve had trouble coming to a design consensus here.

I am getting things set up to run the multithreaded Parquet benchmarks from cuDF so that I can study the design and safety/usability/performance tradeoffs a little more.

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

@wence- Can you expand on this? I am not sure what this API looks like.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 2, 2025

To everyone at NVIDIA, thank you for the insightful discussion and for sharing your thoughts here. I understand there have been even more detailed and thoughtful design conversations internally (it's unfortunate I couldn't be part of those).
I support your conclusions and will continue following this discussion for now.

@bdice
Copy link
Contributor

bdice commented Aug 2, 2025

@JigaoLuo The conversation has mostly been around what we really want -- and trying to keep from asking you to go back and forth on designs (should this go in device_scalar? device_buffer? device_uvector?) since we aren't entirely sure what we want to do here.

Some of the main points we are wrestling with are:

  • when the bounce buffer should be created? (eagerly? on first touch?)
  • how large should the bounce buffer be? (the size of the full buffer/uvector? or just large enough for one element to handle set_element_async? We don't know of many immediate use cases in the Parquet reader for a large pinned buffer because usually we are only getting/setting one element, like the first or last element)
  • what stream safety considerations are needed?
    • If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:
        1. set value to 0 on stream [bounce buffer value is set, and an async copy is initiated]
        1. launch kernel on stream using value
        1. set value to 1 on stream [bounce buffer value is set, and an async copy is initiated]
          where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.
  • should this kind of buffering be done with subclass of device_{buffer/scalar/uvector}? It changes the design pretty significantly.
    • If we implement this in device_buffer, the container is "raw bytes" so we don't have a concept like sizeof(T) that would help us inform the pinned buffer size if we decide that we don't want to have its size match the entire size of the device memory.

For now I think we are hoping to get rapidsai/cudf#19119 moved forward and implement something in libcudf, while continuing to evaluate what design would be most appropriate for RMM.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 2, 2025

Thanks @bdice, and all of you involved.
That makes a lot of sense to me. I’ll postpone this PR draft implementation for now. Hopefully, I’ll return to this after wrapping up my story issue in libcudf.

@wence-
Copy link
Contributor

wence- commented Aug 4, 2025

That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data.

@wence- Can you expand on this? I am not sure what this API looks like.

At the moment, if you want to use a pinned bounce buffer to stage copies you need to manage that yourself:

auto host_data = ...;
auto bounce_buffer = pinned(...);
cudamemcpyasync(bounce_buffer, host_data, stream);
rmm::device_buffer(bounce_buffer, size, stream, mr);

We also offer device_buffer::copy_async to copy out of a device buffer into a given user-provided pointer.

We could imagine providing copy_through_async(void *tgt, void *bounce, void const * src, ...) but maybe it is not worth it.

@wence-
Copy link
Contributor

wence- commented Aug 4, 2025

If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:

* 1. set `value` to `0` on `stream` [bounce buffer value is set, and an async copy is initiated]

* 2. launch kernel on `stream` using `value`

* 3. set `value` to `1` on `stream` [bounce buffer value is set, and an async copy is initiated]
     where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.

If all of the transactions on the pinned bounce buffer are carried out with cudaMemcpyHostToHost then I believe that the host-side copies are stream ordered so this is not an issue.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 4, 2025

If a user calls set_element_async twice in a row, it seems possible that the bounce buffer could be overwritten before the stream-ordered copy takes place -- this could lead to a sequence of events like:

* 1. set `value` to `0` on `stream` [bounce buffer value is set, and an async copy is initiated]

* 2. launch kernel on `stream` using `value`

* 3. set `value` to `1` on `stream` [bounce buffer value is set, and an async copy is initiated]
     where step (3) overwrites the host-side bounce buffer value that will be copied from step (1) before the kernel in step (2) has executed on that stream.

If all of the transactions on the pinned bounce buffer are carried out with cudaMemcpyHostToHost then I believe that the host-side copies are stream ordered so this is not an issue.

Thanks—that aligns perfectly with what we discussed in the comment inside my cuDF draft: rapidsai/cudf#18968 (comment)

@bdice
Copy link
Contributor

bdice commented Aug 4, 2025

@wence- @JigaoLuo Thanks, using cudaMemcpyAsync for stream-ordered host copies is a good solution for my concern about double-writes not being stream ordered. I talked about this with @vuule and that solution hadn't occurred to me.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 4, 2025

That’s another moment I wish I could’ve joined the discussion—I see several points and concerns, especially since RMM is such a fundamental library and deserves careful consideration.

I’d propose to first merge my draft rapidsai/cudf#18968 into cuDF, and postpone this draft in RMM. The work in cuDF could serve as a pre-stage for refining RMM, especially since we also need support for pinned bounce buffers and host-to-host memcpy in cuDF. I hope we can discuss these aspects during the cuDF merge and use that as a springboard to plan what’s next for RMM. (I’m currently juggling a paper deadline, so I apologize in advance if there’s any delay. )

I’ve tried to address most of the key issues, including host-to-host memcpy (which I believe is the only viable solution). My draft rapidsai/cudf#18968 shows a 20% performance improvement and eliminates all CUDA async memcpy with pageable memory. I’ve shared everything I know and considered in the issue rapidsai/cudf#18967 as well, and I put real effort into the writing. I hope you found it helpful—and I’d be glad if you’d like to adapt the text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants