-
Notifications
You must be signed in to change notification settings - Fork 225
Skip test on HMM systems with older CUDA drivers #1944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cd0cfd1
to
be6db5e
Compare
I spent some time at this but was not successful in getting this down to a minimal reproducer to file as an upstream nvbug. I'm going to merge this PR as-is and let it go. I had to update the system where I was working on a minimal reproducer in order to start on CUDA 13 bringup (requires driver 580, which does not reproduce the problem). The impact of this is hopefully minimal, as it is only observed as a test failure in specific circumstances (old driver, new Linux kernel, HMM enabled). HMM works as intended, as the system-allocated pointers on host can be read from device. |
/merge |
Description
This works around the failures observed in #1935 by disabling system memory resource tests on systems with HMM and CUDA drivers older than 12.8 (R575).
I did some local testing with driver R550, which is old enough to reproduce the bug when used with a Linux kernel new enough to support HMM. I believe the system memory resource works as intended on HMM systems with earlier CUDA drivers, but it appears that is_device_accessible_memory returns false for the system-allocated pointers. This may be a driver bug, as the pointers seem to be device-accessible if, e.g., accessed from a CUDA kernel.
Checklist