Skip to content

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Sep 2, 2024

New resource adaptor that uses an alternate upstream resource when the primary throws a specified exception type.

The motivation here is to provide NO-OOM by using managed memory when the primary device resource runs out of memory.

Checklist

  • I am familiar with the Contributing Guidelines.
  • Python tests
  • C++ tests
  • The documentation is up to date with these changes.

@madsbk madsbk added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 2, 2024
@github-actions github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels Sep 2, 2024
@madsbk madsbk marked this pull request as ready for review September 2, 2024 15:18
@madsbk madsbk requested review from a team as code owners September 2, 2024 15:18
@madsbk madsbk requested review from harrism and bdice September 2, 2024 15:18
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pre-empt #1661

Comment on lines 48 to 50
template <typename PrimaryUpstream,
typename AlternateUpstream,
typename ExceptionType = rmm::out_of_memory>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We are moving away from having templated adaptors and instead just using type-erased resource_ref objects (see #1661). To avoid introducing a new such class here, can we immediately move to the new model:

template<typename ExceptionType = rmm::out_of_memory>
class failure_alternate_resource_adaptor final : public device_memory_resource {
   ...
   failure_alternate_resource_adaptor(device_async_resource_ref primary, device_async_resource_ref alternate) : ... {}
   

Or, if we can't type-erase yet, we should at least accept resource refs as an alternate constructor, and store resource_refs rather than templated type-specific usptreams.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cbd7f43, which rely on implicit type conversion from device_memory_resource* to device_async_resource_ref

@harrism
Copy link
Member

harrism commented Sep 3, 2024

Devil's advocating: couldn't we implement this using failure_callback_resource_adaptor? The callback just needs to know about the alternate MR.

@madsbk
Copy link
Member Author

madsbk commented Sep 3, 2024

Devil's advocating: couldn't we implement this using failure_callback_resource_adaptor? The callback just needs to know about the alternate MR.

Not as-is. The callback function in failure_callback_resource_adaptor returns a boolean thus there is no direct way for the callback to return the alternate allocation back to failure_callback_resource_adaptor . Another issue is how to handle the deallocation, an allocation allocated by the alternate resource cannot be deallocated by the primary resource.

It is possible to write a new resource that can handle both failure_callback_resource_adaptor and failure_alternate_resource_adaptor but I think this is a case were two simple resources are better than a complex one.

@madsbk madsbk requested a review from wence- September 3, 2024 06:31
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly doc comments. However also needs C++ tests.

Comment on lines 94 to 95
* @throws `exception_type` if the requested allocation could not be fulfilled
* by the primary or the alternate upstream resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ExceptionType. But actually I don't think you can say what type of exception will be thrown by alternate_upstream if it fails to allocate. It could be a CUDA error, or it could be rmm::out_of_memory, or some other exception.

The alternate upstream should document what exceptions it may throw.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

madsbk and others added 4 commits September 3, 2024 10:30
@madsbk madsbk requested a review from a team as a code owner September 3, 2024 11:08
@github-actions github-actions bot added the CMake label Sep 3, 2024
@madsbk madsbk requested a review from harrism September 3, 2024 11:08
@madsbk
Copy link
Member Author

madsbk commented Sep 3, 2024

Mostly doc comments. However also needs C++ tests.

Added C++ tests

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, good C++ tests.

Co-authored-by: Mark Harris <[email protected]>
@madsbk madsbk mentioned this pull request Sep 4, 2024
4 tasks
* @tparam ExceptionType The type of exception that this adaptor should respond to.
*/
template <typename ExceptionType = rmm::out_of_memory>
class failure_alternate_resource_adaptor final : public device_memory_resource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fallback_resource_adapater feels like a more concise name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will rename.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wence- wence- changed the title Failure alternate resource adaptor Fallback resource adaptor Sep 6, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with request to rename test files appropriately.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still discussing the use cases of this feature offline -- for now, I'm requesting changes so that we don't merge this with typos in the API and filename name.

@madsbk madsbk marked this pull request as draft September 11, 2024 13:30
@madsbk
Copy link
Member Author

madsbk commented Sep 11, 2024

Let's put this on hold until we get some more use cases.

rapids-bot bot pushed a commit to rapidsai/rapidsmpf that referenced this pull request May 20, 2025
Implement out-of-memory protection by using a RMM resource `RmmFallbackResource` based on rapidsai/rmm#1665.

The idea is to use managed memory when the RMM pool raises an OOM error.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #287
@raydouglass raydouglass changed the base branch from branch-24.10 to main June 11, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp Pertains to C++ code improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to RMM Python API
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

5 participants