Skip to content

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jun 4, 2025

Description

This works around the failures observed in #1935 by disabling system memory resource tests on systems with HMM and CUDA drivers older than 12.8 (R575).

I did some local testing with driver R550, which is old enough to reproduce the bug when used with a Linux kernel new enough to support HMM. I believe the system memory resource works as intended on HMM systems with earlier CUDA drivers, but it appears that is_device_accessible_memory returns false for the system-allocated pointers. This may be a driver bug, as the pointers seem to be device-accessible if, e.g., accessed from a CUDA kernel.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested a review from a team as a code owner June 4, 2025 18:25
@bdice bdice requested review from harrism and wence- June 4, 2025 18:25
@bdice bdice added the DO NOT MERGE Hold off on merging; see PR for details label Jun 4, 2025
@bdice bdice changed the base branch from branch-25.08 to branch-25.06 June 4, 2025 18:25
@bdice bdice requested a review from a team as a code owner June 4, 2025 18:49
@github-actions github-actions bot added the CMake label Jun 4, 2025
@bdice bdice requested a review from a team as a code owner June 4, 2025 19:10
@bdice bdice requested a review from AyodeAwe June 4, 2025 19:10
@github-actions github-actions bot added the ci label Jun 4, 2025
@github-actions github-actions bot removed the CMake label Jun 4, 2025
@bdice bdice removed request for a team, harrism, wence- and AyodeAwe June 5, 2025 13:09
@bdice bdice force-pushed the fix-nightly-failure branch from cd0cfd1 to be6db5e Compare June 10, 2025 13:35
@bdice bdice requested a review from a team as a code owner June 10, 2025 13:35
@github-actions github-actions bot added CMake Python Related to RMM Python API conda labels Jul 28, 2025
@bdice bdice changed the base branch from branch-25.08 to branch-25.10 July 28, 2025 22:36
@github-actions github-actions bot removed CMake Python Related to RMM Python API conda labels Jul 28, 2025
@github-actions github-actions bot removed the ci label Jul 30, 2025
@bdice bdice changed the title [DO NOT MERGE] Test HMM behavior Skip test on HMM systems with older CUDA drivers Jul 30, 2025
@bdice bdice removed the DO NOT MERGE Hold off on merging; see PR for details label Jul 30, 2025
@bdice bdice self-assigned this Jul 30, 2025
@bdice bdice added bug Something isn't working improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed improvement Improvement / enhancement to an existing function labels Jul 30, 2025
@wence- wence- removed request for a team and msarahan July 30, 2025 10:02
@bdice
Copy link
Contributor Author

bdice commented Aug 8, 2025

I spent some time at this but was not successful in getting this down to a minimal reproducer to file as an upstream nvbug. I'm going to merge this PR as-is and let it go.

I had to update the system where I was working on a minimal reproducer in order to start on CUDA 13 bringup (requires driver 580, which does not reproduce the problem).

The impact of this is hopefully minimal, as it is only observed as a test failure in specific circumstances (old driver, new Linux kernel, HMM enabled). HMM works as intended, as the system-allocated pointers on host can be read from device. cudaPointerGetAttributes incorrectly tells us that devicePointer is a nullptr, which indicates it's not accessible on device, but it does appear to work in practice.

@bdice
Copy link
Contributor Author

bdice commented Aug 8, 2025

/merge

@rapids-bot rapids-bot bot merged commit 48492f1 into rapidsai:branch-25.10 Aug 8, 2025
54 of 55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants