-
Notifications
You must be signed in to change notification settings - Fork 225
Description
Hello RMM team,
I'm currently using the RMM device memory pool rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource>
as described in the README. My program operates in with multistream and enabled PTDS.
When profiling, I've noticed that the mutex operations in do_allocate
and do_deallocate
become a major bottleneck, for example when using 32 streams.
Bottleneck Example
Let me illustrate this with a simple profiling result. I'm using an RMM device buffer rmm::device_buffer
as temp buffer for calling cub::reduce_by_key
. The execution time of the CUB kernel (~40μs) is only a tiny fraction of the total time. However, both do_allocate
and do_deallocate
each take over 1ms, due to concurrent threads holding the mutex during allocation or deallocation. This issue prevents the CPU from efficiently launching more kernels, and as a result, the GPU isn't fully utilized.
The first pthread_mutex_lock
comes from do_allocate
:
The last pthread_mutex_lock
comes from do_deallocate
:
In RMM: stream_ordered_memory_resource
rmm/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Lines 205 to 259 in da3f558
void* do_allocate(std::size_t size, cuda_stream_view stream) override | |
{ | |
RMM_LOG_TRACE("[A][stream %s][%zuB]", rmm::detail::format_stream(stream), size); | |
if (size <= 0) { return nullptr; } | |
lock_guard lock(mtx_); | |
auto stream_event = get_event(stream); | |
size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT); | |
RMM_EXPECTS(size <= this->underlying().get_maximum_allocation_size(), | |
std::string("Maximum allocation size exceeded (failed to allocate ") + | |
rmm::detail::format_bytes(size) + ")", | |
rmm::out_of_memory); | |
auto const block = this->underlying().get_block(size, stream_event); | |
RMM_LOG_TRACE("[A][stream %s][%zuB][%p]", | |
rmm::detail::format_stream(stream_event.stream), | |
size, | |
block.pointer()); | |
log_summary_trace(); | |
return block.pointer(); | |
} | |
/** | |
* @brief Deallocate memory pointed to by `p`. | |
* | |
* @param p Pointer to be deallocated | |
* @param size The size in bytes of the allocation to deallocate | |
* @param stream The stream in which to order this deallocation | |
*/ | |
void do_deallocate(void* ptr, std::size_t size, cuda_stream_view stream) override | |
{ | |
RMM_LOG_TRACE("[D][stream %s][%zuB][%p]", rmm::detail::format_stream(stream), size, ptr); | |
if (size <= 0 || ptr == nullptr) { return; } | |
lock_guard lock(mtx_); | |
auto stream_event = get_event(stream); | |
size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT); | |
auto const block = this->underlying().free_block(ptr, size); | |
// TODO: cudaEventRecord has significant overhead on deallocations. For the non-PTDS case | |
// we may be able to delay recording the event in some situations. But using events rather than | |
// streams allows stealing from deleted streams. | |
RMM_ASSERT_CUDA_SUCCESS(cudaEventRecord(stream_event.event, stream.value())); | |
stream_free_blocks_[stream_event].insert(block); | |
log_summary_trace(); | |
} |
I've examined the code in RMM and believe that the critical sections in do_allocate
and do_deallocate
are quite large. Given that my program uses PTDS, I think some parts of these functions could be moved outside the critical section. However, implementing such a fix isn't straightforward, and I haven't found an obvious solution yet :(
My Questions
My main question is how to work around the bottlenecks caused by do_allocate
and do_deallocate
. Should I consider switching to another memory resource inside RMM? Or is this overhead an inevitable cost of using a pooled memory system?
Ultimately, I aim to fully optimize the memory allocation process from the CPU mutex side without being limited by these bottlenecks.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status