Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device to Device transfers don't work with OpenMPI + LinkX provider on AMD GPUs #13048

Open
angainor opened this issue Jan 22, 2025 · 8 comments
Assignees
Milestone

Comments

@angainor
Copy link

OpenMPI 5.0.6 with shm+cxi:lnx fails to perform Device - Device transfers on LUMI system (AMD GPUs) with OSU benchmark. Host - Host transfers work as expected for intra- and inter-node transfers. For Device - Device transfers OpenMPI fails with

export FI_LNX_PROV_LINKS=shm+cxi
mpirun --mca opal_common_ofi_provider_include "shm+cxi:lnx" -np 2 -map-by numa ./osu_bibw -m 131072: D D

# OSU MPI-ROCM Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
--------------------------------------------------------------------------
Open MPI failed to register your buffer.
This error is fatal, your job will abort

  Buffer Type: rocm
  Buffer Address: 0x154beaa00000
  Buffer Length: 131072
  Error: Required key not available (4294967030)
--------------------------------------------------------------------------

@hppritcha identified the problem to be related to #11076. There was a fix for this issue in #12290, but it was not merged to the 5.x branch.

@jsquyres
Copy link
Member

AMD: Can you reply?

@edgargabriel
Copy link
Member

I will need the help here from @hppritcha and @amirshehataornl (who developed the linkx provider in libfabric), since I am not thaat familiar with this code path. If its as simple as backport PR #12290, then it shouldn't be a challenge.

@hppritcha
Copy link
Member

This should be assigned to @naughtont3

@jsquyres
Copy link
Member

Thanks @edgargabriel. Might want to look into this soon, so that it can get into 5.0.7 final, if possible.

@tmh97
Copy link

tmh97 commented Feb 12, 2025

I am hitting the same issue with cuda buffers when doing:

export FI_LNX_PROV_LINKS=shm+opx
mpirun --mca opal_common_ofi_provider_include "shm+opx:lnx"

Side note:
I tried the workaround that was applied to cxi here, and replaced cxi with shm on the referenced lines. I was able to run osu_bw D D intranode, but my run only got through 1 byte, then segfaulted.

@hppritcha
Copy link
Member

Does using opx provider alone work for D to D? Why would you mix opx with shm. It already should be giving good intra node messaging performance.

@edgargabriel
Copy link
Member

edgargabriel commented Feb 12, 2025

Created the backport for #12290 with #13090

@jsquyres jsquyres modified the milestones: v5.0.6, v5.0.8 Feb 18, 2025
@tmh97
Copy link

tmh97 commented Feb 20, 2025

@hppritcha OPX provider alone works for D to D intranode, we have our own SHM fifo implementation. But, I was curious if SHM provider has better D to D intranode. It would be convenient for OPX to hook into a shared memory implementation that gives us IPC support and other nice things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants