-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device to Device transfers don't work with OpenMPI + LinkX provider on AMD GPUs #13048
Comments
AMD: Can you reply? |
I will need the help here from @hppritcha and @amirshehataornl (who developed the linkx provider in libfabric), since I am not thaat familiar with this code path. If its as simple as backport PR #12290, then it shouldn't be a challenge. |
This should be assigned to @naughtont3 |
Thanks @edgargabriel. Might want to look into this soon, so that it can get into 5.0.7 final, if possible. |
I am hitting the same issue with export FI_LNX_PROV_LINKS=shm+opx Side note: |
Does using opx provider alone work for D to D? Why would you mix opx with shm. It already should be giving good intra node messaging performance. |
@hppritcha OPX provider alone works for D to D intranode, we have our own SHM fifo implementation. But, I was curious if SHM provider has better D to D intranode. It would be convenient for OPX to hook into a shared memory implementation that gives us IPC support and other nice things. |
OpenMPI 5.0.6 with
shm+cxi:lnx
fails to perform Device - Device transfers on LUMI system (AMD GPUs) with OSU benchmark. Host - Host transfers work as expected for intra- and inter-node transfers. For Device - Device transfers OpenMPI fails with@hppritcha identified the problem to be related to #11076. There was a fix for this issue in #12290, but it was not merged to the 5.x branch.
The text was updated successfully, but these errors were encountered: