Skip to content

MPI One-sided applications only work via Infiniband adapter as expected, but not via Ethernet adapter #13128

Open
@KADichev

Description

@KADichev

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

I am unable to verify that my MPI one-sided applications can use TCP sockets/Ethernet instead of Infiniband on a cluster with both Ethernet and Infiniband adapters / switches. I either get Infiniband-level performance whatever I try to disable Infiniband, or I get hangups when I compile Open MPI with Infiniband disabled. These issues do not manifest for codes using MPI 2-sided.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • The IB-enabled Open MPI: Pre-packaged openmpi-4.1.7a from MNLX_OFED package from NVidia website
  • The IB-disabled Open MPI: openmpi-4.1.4 from source, compiled via ../configure --without-ucx

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
    Ubuntu 22.04.4 LTS
  • Computer hardware:
    ARM nodes with Ethernet cards and Mellanox Infiniband cards (>4 years old)
  • Network type:
    Both Ethernet and Mellanox Infiniband

Details of the problem

I am unable to run MPI 1-sided applications without Infiniband, but instead using sockets / the Ethernet card.
My efforts to disable Infiniband are as follows:
mpirun --mca pml ^ucx --mca btl ^vader ./my-app
The application runs with actual throughput approaching 40-50 Gbps, and our network has a 25 Gigabit Ethernet, and a 100Gb Infiniband. I am now convinced my instructions are ignored and Infiniband is being used.
In fact, the actual throughput is identical (40-50 Gbps) to using following line:
mpirun --mca pml ucx ./my-app

However, when I use e.g. NetPipe, which relies on P2P calls, I do get exactly as expected actual throughput -- ~90Gbps with --mca pml ucx, and ~16 Gbps with --mca pml ^ucx, which fits perfectly with the underlying hardware.

So it seems that using one-sided communication is simply not compatible with the Ethernet card / sockets.

I tried to get more output via the verbose flags, but I am struggling to see a definitive answer.
On the other hand, I tried compiling Open MPI disabling UCX. In this case, my application completely hangs, as it seems in MPI_Win_flush:

#0  0x000040000832be3c in __GI___poll (fds=0xaaaaf30549f0, nfds=4, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x0000400008994b44 in ?? () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#2  0x0000400008990140 in event_base_loop () from /lib/aarch64-linux-gnu/libevent_core-2.1.so.7
#3  0x000040000873dcfc in opal_progress_events.isra () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#4  0x000040000873de54 in opal_progress () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libopen-pal.so.40
#5  0x000040000a205c2c in ompi_osc_pt2pt_flush_lock () from /home/kdichev/openmpi-4.1.4/build-arm/lib/openmpi/mca_osc_pt2pt.so
#6  0x0000400007f252b0 in PMPI_Win_flush () from /home/kdichev/openmpi-4.1.4/build-arm/lib/libmpi.so.40

I know asking for Ethernet when using MPI 1-sided, and having an Infiniband cluster, is unusual, I just want some reference numbers. Still, isn't Open MPI supposed to work in this scenario too, converting one-sided calls to some sort of active messages issuing point-to-point calls on the remote side and ultimately doing 2-sided communication? In any case, I can't get this to work.

Any clarification would be much appreciated!

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions