-
Notifications
You must be signed in to change notification settings - Fork 225
[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to rmm::device_buffer
#1996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to rmm::device_buffer
#1996
Conversation
_device{other._device} | ||
_device{other._device}, | ||
_host_mr{other._host_mr}, | ||
_host_bounce_buffer{other._host_bounce_buffer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Line 90: I’ll also start with a simple discussion point: I believe the bounce buffer is stateless—it doesn’t store anything and serves only as a temporary transfer buffer. If you agree, I’d prefer to leave it uncopied or moved, to reflect its transient nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't store anything, but it is an allocation that's a part of the object's state.
When we copy an object, I assume we want the new one to have the equivalent bounce buffer. So, we don't need to copy, but still need to allocate a new one.
When we move, reusing the old allocation should be the cheapest option, even though we don't care about the content.
_mr = other._mr; | ||
_device = other._device; | ||
_host_mr = other._host_mr; | ||
_host_bounce_buffer = other._host_bounce_buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Line 114: Same reasoning as my earlier point about statelessness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to skip reviewing the unit tests for now.
I think that it's important to document the increased memory usage that happens if the user passes a host_mr. It effectively doubles -- half on device_mr and half on host_mr. And I think the need for this is variable. For example, if this buffer backs a So I'm wondering if the host allocation should be lazy -- only allocate what is needed for a copy when it is needed? Performance will rely on a fast host_mr in that case. This may need more experimentation. |
Hi @harrism , Thanks for your thoughtful comments, and apologies for the delayed reply—I’ve been tied up with teaching duties over the past few days. I see a few key points you raised:
Let me share some initial thoughts: On point 0 (as the starting point)Regarding performance improvements, I’ve been working on a PR draft with benchmarking in cuDF rapidsai/cudf#18968. As shown in my plot, removing pageable-host memory copies appears to be the most reasonable path toward better scalable performance. I also covered why I am doing this in this issue for future readers: rapidsai/cudf#18967 On point 1In the current implementation, host memory is optional—leaving it to the RMM-user to decide whether to trade off more pinned memory for performance, especially for I fully agree we should run experiments to understand how much additional memory is used. For that, we’d need a suitable workload where some As for flexible sizing, I agree in principle. However, from a design perspective, coupling buffer logic with future reduction logic may not be ideal. I believe we should aim for a more abstract separation between buffer (memory management) and computation logic (e.g., reduction). On point 2Lazy allocation scoped to copy operations sounds like a solid idea and should be feasible from an engineering standpoint. I’ll try drafting something. In my parallelism case with multiple CUDA streams, I don’t expect a major bandwidth difference—though latency benchmarks might show some variation. Overall, I’d be happy to collaborate on a full benchmark. That said, I think we need a more structured discussion to align on a plan, as there are quite a few directions to explore. Once we reach consensus, I’d be confident moving forward with experiments to demonstrate that this approach can deliver meaningful speedups—like the 20% improvement we saw in the Parquet benchmark in rapidsai/cudf#18968 —and eventually a merge of this PR draft. |
if (result == cudaSuccess && attributes.type == cudaMemoryTypeHost) { | ||
RMM_CUDA_TRY(cudaMemcpyAsync( | ||
_host_bounce_buffer.value(), source, bytes, cudaMemcpyHostToHost, stream().value())); | ||
RMM_CUDA_TRY(cudaMemcpyAsync( | ||
_data, _host_bounce_buffer.value(), bytes, cudaMemcpyHostToDevice, stream().value())); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user's source pointer is in page-locked host memory this is a pessimisation, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is only executes when user passes a host mr, with the expectation that this would be some kind of pinned memory allocator. When the host mr is not set (the default case), copy is done the same way it was before this PR.
I tried reading through the past discussions. My gut feeling is that I don't think RMM should be this opinionated about how copies are performed between host and device memory. For example, an application might know that it has a data access pattern that only requires the use of a single pinned bounce buffer and want to manage things manually. Rather than here where as soon as you pass a host MR you're committing to using that as a bounce buffer. Relatedly I think, like @harrism, that eagerly releasing/reallocating a bounce buffer on every resize is probably a pessimisation. You might be updating the device data and not intend to copy it back to host. I think if we do want a model like this, we should only ensure the bounce buffer is appropriately sized when a copy is requested. That said, it might be useful if RMM were to provide utilities that help users with this kind of staging. Maybe even having copy overloads for device buffers that accept a user-provided bounce buffer as well as the source/target host data. WDYT? |
That is a good point. @bdice and I discussed an option of having a single-element bounce buffer in IMO this kind of optimization seems to fit in |
We’ve had some trouble deciding on the appropriate scope for this internally. Perhaps we should do the least intrusive thing and only change device_scalar as @vuule is suggesting. I recognize this is a reversal of what we previously considered in #1985 and I apologize that we’ve had trouble coming to a design consensus here. I am getting things set up to run the multithreaded Parquet benchmarks from cuDF so that I can study the design and safety/usability/performance tradeoffs a little more.
@wence- Can you expand on this? I am not sure what this API looks like. |
To everyone at NVIDIA, thank you for the insightful discussion and for sharing your thoughts here. I understand there have been even more detailed and thoughtful design conversations internally (it's unfortunate I couldn't be part of those). |
@JigaoLuo The conversation has mostly been around what we really want -- and trying to keep from asking you to go back and forth on designs (should this go in Some of the main points we are wrestling with are:
For now I think we are hoping to get rapidsai/cudf#19119 moved forward and implement something in libcudf, while continuing to evaluate what design would be most appropriate for RMM. |
Thanks @bdice, and all of you involved. |
At the moment, if you want to use a pinned bounce buffer to stage copies you need to manage that yourself:
We also offer We could imagine providing |
If all of the transactions on the pinned bounce buffer are carried out with |
Thanks—that aligns perfectly with what we discussed in the comment inside my cuDF draft: rapidsai/cudf#18968 (comment) |
That’s another moment I wish I could’ve joined the discussion—I see several points and concerns, especially since RMM is such a fundamental library and deserves careful consideration. I’d propose to first merge my draft rapidsai/cudf#18968 into cuDF, and postpone this draft in RMM. The work in cuDF could serve as a pre-stage for refining RMM, especially since we also need support for pinned bounce buffers and host-to-host memcpy in cuDF. I hope we can discuss these aspects during the cuDF merge and use that as a springboard to plan what’s next for RMM. (I’m currently juggling a paper deadline, so I apologize in advance if there’s any delay. ) I’ve tried to address most of the key issues, including host-to-host memcpy (which I believe is the only viable solution). My draft rapidsai/cudf#18968 shows a 20% performance improvement and eliminates all CUDA async memcpy with pageable memory. I’ve shared everything I know and considered in the issue rapidsai/cudf#18967 as well, and I put real effort into the writing. I hope you found it helpful—and I’d be glad if you’d like to adapt the text. |
Description
This is an initial draft for Issue #1959 and #1955 that adds a host memory resource for a pinned bounce buffer into
rmm::device_buffer
.We’ve had discussions in the previous PR draft #1985 and even earlier in cuDF rapidsai/cudf#19119 . And this version is now into
rmm::device_buffer
.I’m aware this draft is still far from being ready to merge. But I believe it’s a good starting point, and I’d really appreciate your feedback as I continue refining it. I hope it sparks discussion (again 😃 )—happy to revisit it if changes are needed. Just aiming to get the conversation started.
Checklist