[QST] How to reduce the overhead & bottlenecks of `do_allocate` and `do_deallocate` with multistreams?

Hello RMM team,

I'm currently using the RMM device memory pool `rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource>` as described in the README. My program operates in with multistream and enabled PTDS. 
When profiling, I've noticed that the mutex operations in `do_allocate` and `do_deallocate` become a major *bottleneck*, for example when using 32 streams.

# Bottleneck Example
Let me illustrate this with a simple profiling result. I'm using an RMM device buffer `rmm::device_buffer` as temp buffer for calling `cub::reduce_by_key`. The execution time of the CUB kernel (~40μs) is only a tiny fraction of the total time. However, both `do_allocate` and `do_deallocate` each take over 1ms, due to concurrent threads holding the mutex during allocation or deallocation. This issue prevents the CPU from efficiently launching more kernels, and as a result, the GPU isn't fully utilized.

![Image](https://github.com/user-attachments/assets/3880035a-b709-4e47-86bb-1184e01e6e21)

The first `pthread_mutex_lock` comes from `do_allocate`:
![Image](https://github.com/user-attachments/assets/8865be0a-c4a2-460d-bd89-4063108926fb)

The last `pthread_mutex_lock` comes from `do_deallocate`:
![Image](https://github.com/user-attachments/assets/6482591b-b404-46d6-a88b-9bb9f8c1bddd)

# In RMM: `stream_ordered_memory_resource`

https://github.com/rapidsai/rmm/blob/da3f558357033ee9b1caed70109453d9a35b9630/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp#L205-L259

I've examined the code in RMM and believe that the critical sections in `do_allocate` and `do_deallocate` are quite large. Given that my program uses PTDS, I think some parts of these functions could be moved outside the critical section. However, implementing such a fix isn't straightforward, and I haven't found an obvious solution yet :(

# My Questions

My main question is how to work around the bottlenecks caused by `do_allocate` and `do_deallocate`. Should I consider switching to another memory resource inside RMM? Or is this overhead an inevitable cost of using a pooled memory system? 
Ultimately, I aim to fully optimize the memory allocation process from the CPU mutex side without being limited by these bottlenecks.

	void* do_allocate(std::size_t size, cuda_stream_view stream) override
	{
	RMM_LOG_TRACE("[A][stream %s][%zuB]", rmm::detail::format_stream(stream), size);

	if (size <= 0) { return nullptr; }

	lock_guard lock(mtx_);

	auto stream_event = get_event(stream);

	size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT);
	RMM_EXPECTS(size <= this->underlying().get_maximum_allocation_size(),
	std::string("Maximum allocation size exceeded (failed to allocate ") +
	rmm::detail::format_bytes(size) + ")",
	rmm::out_of_memory);
	auto const block = this->underlying().get_block(size, stream_event);

	RMM_LOG_TRACE("[A][stream %s][%zuB][%p]",
	rmm::detail::format_stream(stream_event.stream),
	size,
	block.pointer());

	log_summary_trace();

	return block.pointer();
	}

	/**
	* @brief Deallocate memory pointed to by `p`.
	*
	* @param p Pointer to be deallocated
	* @param size The size in bytes of the allocation to deallocate
	* @param stream The stream in which to order this deallocation
	*/
	void do_deallocate(void* ptr, std::size_t size, cuda_stream_view stream) override
	{
	RMM_LOG_TRACE("[D][stream %s][%zuB][%p]", rmm::detail::format_stream(stream), size, ptr);

	if (size <= 0 \|\| ptr == nullptr) { return; }

	lock_guard lock(mtx_);
	auto stream_event = get_event(stream);

	size = rmm::align_up(size, rmm::CUDA_ALLOCATION_ALIGNMENT);
	auto const block = this->underlying().free_block(ptr, size);

	// TODO: cudaEventRecord has significant overhead on deallocations. For the non-PTDS case
	// we may be able to delay recording the event in some situations. But using events rather than
	// streams allows stealing from deleted streams.
	RMM_ASSERT_CUDA_SUCCESS(cudaEventRecord(stream_event.event, stream.value()));

	stream_free_blocks_[stream_event].insert(block);

	log_summary_trace();
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] How to reduce the overhead & bottlenecks of `do_allocate` and `do_deallocate` with multistreams? #1887

Bottleneck Example

In RMM: `stream_ordered_memory_resource`

My Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] How to reduce the overhead & bottlenecks of do_allocate and do_deallocate with multistreams? #1887

Description

Bottleneck Example

In RMM: stream_ordered_memory_resource

My Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[QST] How to reduce the overhead & bottlenecks of `do_allocate` and `do_deallocate` with multistreams? #1887

In RMM: `stream_ordered_memory_resource`