Description
Please submit all the information below so that we can understand the working environment that is the context for your question.
Background information
I am unable to verify that my MPI one-sided applications can use TCP sockets/Ethernet instead of Infiniband on a cluster with both Ethernet and Infiniband adapters / switches. I either get Infiniband-level performance whatever I try to disable Infiniband, or I get hangups when I compile Open MPI with Infiniband disabled. These issues do not manifest for codes using MPI 2-sided.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
- The IB-enabled Open MPI: Pre-packaged openmpi-4.1.7a from MNLX_OFED package from NVidia website
- The IB-disabled Open MPI: openmpi-4.1.4 from source, compiled via
../configure --without-ucx
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version:
Ubuntu 22.04.4 LTS - Computer hardware:
ARM nodes with Ethernet cards and Mellanox Infiniband cards (>4 years old) - Network type:
Both Ethernet and Mellanox Infiniband
Details of the problem
I am unable to run MPI 1-sided applications without Infiniband, but instead using sockets / the Ethernet card.
My efforts to disable Infiniband are as follows:
mpirun --mca pml ^ucx --mca btl ^vader ./my-app
The application runs with actual throughput approaching 40-50 Gbps, and our network has a 25 Gigabit Ethernet, and a 100Gb Infiniband. I am now convinced my instructions are ignored and Infiniband is being used.
In fact, the actual throughput is identical (40-50 Gbps) to using following line:
mpirun --mca pml ucx ./my-app
However, when I use e.g. NetPipe, which relies on P2P calls, I do get exactly as expected actual throughput -- ~90Gbps with --mca pml ucx
, and ~16 Gbps with --mca pml ^ucx
, which fits perfectly with the underlying hardware.
So it seems that using one-sided communication is simply not compatible with the Ethernet card / sockets.
I tried to get more output via the verbose flags, but I am struggling to see a definitive answer.
On the other hand, I tried compiling Open MPI disabling UCX. In this case, my application completely hangs, as it seems in MPI_Win_flush:
#0 0x000040000832be3c in __GI___poll (fds=0xaaaaf30549f0, nfds=4, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1 0x0000400008994b44 in ?? () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#2 0x0000400008990140 in event_base_loop () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#3 0x000040000873dcfc in opal_progress_events.isra () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#4 0x000040000873de54 in opal_progress () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#5 0x000040000a205c2c in ompi_osc_pt2pt_flush_lock () from /home/kdichev/openmpi-4.1.4/build-arm/lib/openmpi/mca_osc_pt2pt.so
#6 0x0000400007f252b0 in PMPI_Win_flush () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libmpi.so.40
I know asking for Ethernet when using MPI 1-sided, and having an Infiniband cluster, is unusual, I just want some reference numbers. Still, isn't Open MPI supposed to work in this scenario too, converting one-sided calls to some sort of active messages issuing point-to-point calls on the remote side and ultimately doing 2-sided communication? In any case, I can't get this to work.
Any clarification would be much appreciated!