Is your feature request related to a problem? Please describe.
{
// mr here is rmm::mr::pool_memory_resource<rmm::mr::pinned_memory_resource>
rmm::device_uvector<int> tmps(1024, stream, mr);
thrust::for_each(rmm::exec_policy_nosync(stream), tmps.beign(), tmps.end(),
[]__device__(auto val) { ... });
}
Here, users may expect tmps to follow stream semantics but it gets immediately deallocated once the variable becomes out-of-scope (before the thrust::for_each call actually finishes).
This behavior can be surprising to some users and the fix requires an explicit stream synchronization (and this may have unnecessary performance overhead).
Describe the solution you'd like
rmm::mr::pool_memory_resource to follow stream semantics even when the upstream memory allocator is (host) pinned_memory_resource.
Describe alternatives you've considered
Explicitly calling cudaStreamSynchronize();