Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI One-sided applications only work via Infiniband adapter as expected, but not via Ethernet adapter #13128

Open
KADichev opened this issue Mar 6, 2025 · 2 comments

Comments

@KADichev
Copy link

KADichev commented Mar 6, 2025

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

I am unable to verify that my MPI one-sided applications can use TCP sockets/Ethernet instead of Infiniband on a cluster with both Ethernet and Infiniband adapters / switches. I either get Infiniband-level performance whatever I try to disable Infiniband, or I get hangups when I compile Open MPI with Infiniband disabled. These issues do not manifest for codes using MPI 2-sided.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • The IB-enabled Open MPI: Pre-packaged openmpi-4.1.7a from MNLX_OFED package from NVidia website
  • The IB-disabled Open MPI: openmpi-4.1.4 from source, compiled via ../configure --without-ucx

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
    Ubuntu 22.04.4 LTS
  • Computer hardware:
    ARM nodes with Ethernet cards and Mellanox Infiniband cards (>4 years old)
  • Network type:
    Both Ethernet and Mellanox Infiniband

Details of the problem

I am unable to run MPI 1-sided applications without Infiniband, but instead using sockets / the Ethernet card.
My efforts to disable Infiniband are as follows:
mpirun --mca pml ^ucx --mca btl ^vader ./my-app
The application runs with actual throughput approaching 40-50 Gbps, and our network has a 25 Gigabit Ethernet, and a 100Gb Infiniband. I am now convinced my instructions are ignored and Infiniband is being used.
In fact, the actual throughput is identical (40-50 Gbps) to using following line:
mpirun --mca pml ucx ./my-app

However, when I use e.g. NetPipe, which relies on P2P calls, I do get exactly as expected actual throughput -- ~90Gbps with --mca pml ucx, and ~16 Gbps with --mca pml ^ucx, which fits perfectly with the underlying hardware.

So it seems that using one-sided communication is simply not compatible with the Ethernet card / sockets.

I tried to get more output via the verbose flags, but I am struggling to see a definitive answer.
On the other hand, I tried compiling Open MPI disabling UCX. In this case, my application completely hangs, as it seems in MPI_Win_flush:

#0  0x000040000832be3c in __GI___poll (fds=0xaaaaf30549f0, nfds=4, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x0000400008994b44 in ?? () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#2  0x0000400008990140 in event_base_loop () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#3  0x000040000873dcfc in opal_progress_events.isra () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#4  0x000040000873de54 in opal_progress () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#5  0x000040000a205c2c in ompi_osc_pt2pt_flush_lock () from /home/kdichev/openmpi-4.1.4/build-arm/lib/openmpi/mca_osc_pt2pt.so
#6  0x0000400007f252b0 in PMPI_Win_flush () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libmpi.so.40

I know asking for Ethernet when using MPI 1-sided, and having an Infiniband cluster, is unusual, I just want some reference numbers. Still, isn't Open MPI supposed to work in this scenario too, converting one-sided calls to some sort of active messages issuing point-to-point calls on the remote side and ultimately doing 2-sided communication? In any case, I can't get this to work.

Any clarification would be much appreciated!

@devreal
Copy link
Contributor

devreal commented Mar 7, 2025

Thanks for the report. We removed the osc/pt2pt (which seems to be used without UCX in your case) in 5.0 because it was unmaintained. Could you try running the non-UCX version with --mca osc ^pt2pt?

@KADichev
Copy link
Author

KADichev commented Mar 7, 2025

Thanks for the quick response. Indeed, using --mca osc ^pt2pt allowed me to progress further in the application. Sadly, very soon after, I got an error in the very first MPI_Win_allocate call:


[srv02:1441844] *** An error occurred in MPI_Win_allocate
[srv02:1441844] *** reported by process [1707540481,70368744177664]
[srv02:1441844] *** on communicator MPI_COMM_WORLD
[srv02:1441844] *** MPI_ERR_WIN: invalid window
[srv02:1441844] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[srv02:1441844] ***    and potentially your MPI job)
[srv02:1441834] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[srv02:1441834] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[srv02:1441834] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[srv02:1441834] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

I reviewed for quite some time the code and after finding no errors, I threw following public example using MPI_Win_allocate as a standalone test, again using mpirun --mca osc ^pt2pt

https://rookiehpc.org/mpi/docs/mpi_win_allocate/index.html

The exact same error was thrown! This means that it is likely the issue is in using UCX-disabled 1-sided in general. I am still unable to successfully run any code in this way.

(NB: I also noticed earlier that in the NVidia-packaged UCX-enabled version I was using --mca pml ^ucx or --mca pml ucx - it made sense that it only affected 2-sided NetPIPE and not any 1-sided code. Sorry about that. But let's focus on the issues with the UCX-disabled Open MPI 4.1.4 here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants