Skip to content

[QST] How to reduce the overhead & bottlenecks of do_allocate and do_deallocate with multistreams? #1887

@JigaoLuo

Description

@JigaoLuo

Hello RMM team,

I'm currently using the RMM device memory pool rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> as described in the README. My program operates in with multistream and enabled PTDS.
When profiling, I've noticed that the mutex operations in do_allocate and do_deallocate become a major bottleneck, for example when using 32 streams.

Bottleneck Example

Let me illustrate this with a simple profiling result. I'm using an RMM device buffer rmm::device_buffer as temp buffer for calling cub::reduce_by_key. The execution time of the CUB kernel (~40μs) is only a tiny fraction of the total time. However, both do_allocate and do_deallocate each take over 1ms, due to concurrent threads holding the mutex during allocation or deallocation. This issue prevents the CPU from efficiently launching more kernels, and as a result, the GPU isn't fully utilized.

Image

The first pthread_mutex_lock comes from do_allocate:
Image

The last pthread_mutex_lock comes from do_deallocate:
Image

In RMM: stream_ordered_memory_resource

void* do_allocate(std::size_t size, cuda_stream_view stream) override
{
RMM_LOG_TRACE("[A][stream %s][%zuB]", rmm::detail::format_stream(stream), size);
if (size <= 0) { return nullptr; }
lock_guard lock(mtx_);
auto stream_event = get_event(stream);
size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT);
RMM_EXPECTS(size <= this->underlying().get_maximum_allocation_size(),
std::string("Maximum allocation size exceeded (failed to allocate ") +
rmm::detail::format_bytes(size) + ")",
rmm::out_of_memory);
auto const block = this->underlying().get_block(size, stream_event);
RMM_LOG_TRACE("[A][stream %s][%zuB][%p]",
rmm::detail::format_stream(stream_event.stream),
size,
block.pointer());
log_summary_trace();
return block.pointer();
}
/**
* @brief Deallocate memory pointed to by `p`.
*
* @param p Pointer to be deallocated
* @param size The size in bytes of the allocation to deallocate
* @param stream The stream in which to order this deallocation
*/
void do_deallocate(void* ptr, std::size_t size, cuda_stream_view stream) override
{
RMM_LOG_TRACE("[D][stream %s][%zuB][%p]", rmm::detail::format_stream(stream), size, ptr);
if (size <= 0 || ptr == nullptr) { return; }
lock_guard lock(mtx_);
auto stream_event = get_event(stream);
size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT);
auto const block = this->underlying().free_block(ptr, size);
// TODO: cudaEventRecord has significant overhead on deallocations. For the non-PTDS case
// we may be able to delay recording the event in some situations. But using events rather than
// streams allows stealing from deleted streams.
RMM_ASSERT_CUDA_SUCCESS(cudaEventRecord(stream_event.event, stream.value()));
stream_free_blocks_[stream_event].insert(block);
log_summary_trace();
}

I've examined the code in RMM and believe that the critical sections in do_allocate and do_deallocate are quite large. Given that my program uses PTDS, I think some parts of these functions could be moved outside the critical section. However, implementing such a fix isn't straightforward, and I haven't found an obvious solution yet :(

My Questions

My main question is how to work around the bottlenecks caused by do_allocate and do_deallocate. Should I consider switching to another memory resource inside RMM? Or is this overhead an inevitable cost of using a pooled memory system?
Ultimately, I aim to fully optimize the memory allocation process from the CPU mutex side without being limited by these bottlenecks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    Status

    To-do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions