Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open MPI 5 doesn't work after update to macOS 15.3.1, 4.1 does #13129

Open
mathomp4 opened this issue Mar 6, 2025 · 31 comments
Open

Open MPI 5 doesn't work after update to macOS 15.3.1, 4.1 does #13129

mathomp4 opened this issue Mar 6, 2025 · 31 comments

Comments

@mathomp4
Copy link

mathomp4 commented Mar 6, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed by hand via:

ml clang-gfortran/14

mkdir build-clang-gfortran-14 && cd build-clang-gfortran-14

../configure --disable-wrapper-rpath --disable-wrapper-runpath \
  CC=clang CXX=clang++ FC=gfortran-14 \
  --with-hwloc=internal --with-libevent=internal --with-pmix=internal \
  --prefix=$HOME/installed/Compiler/clang-gfortran-14/openmpi/5.0.7 |& tee configure.clang-gfortran-14.log

mv config.log config.clang-gfortran-14.log
make -j6 |& tee make.clang-gfortran-14.log
make install |& tee makeinstall.clang-gfortran-14.log

Please describe the system on which you are running

  • Operating system/version: macOS 15.3.1
  • Computer hardware: Mac Studio M1 Max
  • Network type: Ethernet (though only running locally)

Details of the problem

Recently, I updated my Mac Studio to macOS 15.3.1 from macOS 14. The upgrade went fine until recently when I tried to run some MPI code and found it failing. I then tried HelloWorld that that fails. But that was with an Open MPI 5.0.5 built back in the Sonoma days. So I grabbed the 5.0.7 tarfile and built as shown above (exactly how I built with 5.0.5), but no joy. When I try and run helloworld I get this:

❯ mpirun --version
mpirun (Open MPI) 5.0.7

Report bugs to https://www.open-mpi.org/community/help/
❯ mpifort -o helloWorld.mpi2.exe helloWorld.mpi2.F90
❯ mpirun -np 2 ./helloWorld.mpi2.exe
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[gs6101-alderaan-198120226026:00000] *** An error occurred in MPI_Init
[gs6101-alderaan-198120226026:00000] *** reported by process [3839623169,1]
[gs6101-alderaan-198120226026:00000] *** on a NULL communicator
[gs6101-alderaan-198120226026:00000] *** Unknown error
[gs6101-alderaan-198120226026:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gs6101-alderaan-198120226026:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-gs6101-alderaan-198120226026-45102@1,1]
   Exit code:    14
--------------------------------------------------------------------------

Now, I looked around the internet (and these issues) and found things like #12273 which suggested --pmixmca ptl_tcp_if_include lo0, but:

❯ mpirun --pmixmca ptl_tcp_if_include lo0 -np 2 ./helloWorld.mpi2.exe
--------------------------------------------------------------------------
The PMIx server's listener thread failed to start. We cannot
continue.
--------------------------------------------------------------------------

So not that. But threads also said to try --mca ptl_tcp_if_include lo0 so:

❯ mpirun --mca btl_tcp_if_include lo0 -np 2 ./helloWorld.mpi2.exe
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[gs6101-alderaan-198120226026:00000] *** An error occurred in MPI_Init
[gs6101-alderaan-198120226026:00000] *** reported by process [4258004993,1]
[gs6101-alderaan-198120226026:00000] *** on a NULL communicator
[gs6101-alderaan-198120226026:00000] *** Unknown error
[gs6101-alderaan-198120226026:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gs6101-alderaan-198120226026:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-gs6101-alderaan-198120226026-45932@1,1]
   Exit code:    14
--------------------------------------------------------------------------

I also tried a few other things in combination that I had commented in a modulefile:

-- setenv("OMPI_MCA_btl_tcp_if_include","lo0")
-- setenv("OMPI_MCA_io","ompio")
-- setenv("OMPI_MCA_btl","^tcp")

but nothing helped.

But, on a supercomputer cluster I work on, Open MPI 5 has just never worked for us, so I thought "Let me try Open MPI 4.1" so I grabbed 4.1.8 and built as:

../configure --disable-wrapper-rpath --disable-wrapper-runpath \
  CC=clang CXX=clang++ FC=gfortran-14 \
  --with-hwloc=internal --with-libevent=internal --with-pmix=internal \
  --prefix=$HOME/installed/Compiler/clang-gfortran-14/openmpi/4.1.8 |& tee configure.clang-gfortran-14.log

(exactly the same as 5.0.7 save where I'm installing) and then:

❯ mpirun --version
mpirun (Open MPI) 4.1.8

Report bugs to http://www.open-mpi.org/community/help/
❯ mpifort -o helloWorld.mpi2.exe helloWorld.mpi2.F90
❯ mpirun -np 2 ./helloWorld.mpi2.exe
Compiler Version: GCC version 14.2.0
MPI Version: 3.1
MPI Library Version: Open MPI v4.1.8, package: Open MPI [email protected] Distribution, ident: 4.1.8, repo rev: v4.1.8, Feb 04, 2025
Process    0 of    2 is on gs6101-alderaan-198120226026.ndc.nasa.gov
Process    1 of    2 is on gs6101-alderaan-198120226026.ndc.nasa.gov

So, good news, I can keep working with Open MPI 4.1. Bad news, I'm unsure why Open MPI 5 stopped working all of the sudden. It did work before the OS update.

@rhc54
Copy link
Contributor

rhc54 commented Mar 6, 2025

Works fine for me on Mac Sequioa 15.3.1:

$ mpirun -n 2 ./hello_c
Hello, world, I am 1 of 2, (Open MPI v5.0.8a1, package: Open MPI [email protected] Distribution, ident: 5.0.8a1, repo rev: v5.0.7-12-gc1d3071a86, Unreleased developer copy, 148)
Hello, world, I am 0 of 2, (Open MPI v5.0.8a1, package: Open MPI [email protected] Distribution, ident: 5.0.8a1, repo rev: v5.0.7-12-gc1d3071a86, Unreleased developer copy, 148)
$

Didn't need to set anything. Did you remember to create a local tmp directory (e.g., $HOME/tmp) and point your tmpdir at it (e.g., export TMPDIR=$HOME/tmp)? Apple started those ridiculously long tmpdir paths years ago that messed things up, so you need the shorter path to make things work. May or may not be the problem here - just something to remember.

Open MPI 5 has just never worked for us

Has someone filed a ticket about it? If so, can you cite it here?

@bosilca
Copy link
Member

bosilca commented Mar 7, 2025

Puzzling, especially as most of OMPI developers are using OSX, in general an up-to-date version. I agree with @rhc54, 5.0 and main runs out of the box for me on 15.3.1.

Can you run your mpirun command adding --mca pml_base_verbose 100 --mca btl_base_verbose 100 to see if we get some meaningful output from this.

@ggouaillardet
Copy link
Contributor

@mathomp4 The error is at the MPI layer.

can you confirm

mpirun -np 2 --pmixmca ptl_tcp_if_include lo0 hostname

works as expected?

You might have more luck with

mpirun -np 2 --pmixmca ptl_tcp_if_include lo0 --mca btl self,sm ./helloWorld.mpi2.exe

you also want to double check mpirun and mpifort are provided by the same package.

@mathomp4
Copy link
Author

mathomp4 commented Mar 7, 2025

First:

❯ which mpirun
/Users/mathomp4/installed/Compiler/clang-gfortran-14/openmpi/5.0.7/bin/mpirun
…/MPITests on  main with zsh at 08:41:35 AM
❯ which mpifort
/Users/mathomp4/installed/Compiler/clang-gfortran-14/openmpi/5.0.7/bin/mpifort
❯ mpirun --version
mpirun (Open MPI) 5.0.7

Report bugs to https://www.open-mpi.org/community/help/

So these are the mpirun and mpifort that I built.

Also, I have no OMPI_ env variables set:

❯ env | grep OMPI
LMOD_FAMILY_COMPILER=clang-gfortran
LMOD_FAMILY_COMPILER_VERSION=14

and I have TMPDIR set to /tmp

❯ echo $TMPDIR
/tmp

but nothing changes if I put it at $HOME/tmp

As for tests per @ggouaillardet, the first does not work:

❯ mpirun -np 2 --pmixmca ptl_tcp_if_include lo0 hostname
--------------------------------------------------------------------------
The PMIx server's listener thread failed to start. We cannot
continue.
--------------------------------------------------------------------------

while a boring one does:

❯ mpirun -np 2 hostname
gs6101-Dagobah.ndc.nasa.gov
gs6101-Dagobah.ndc.nasa.gov

And since the hostname call fails, the other will as well:

❯ mpirun -np 2 --pmixmca ptl_tcp_if_include lo0 --mca btl self,sm ./helloWorld.mpi2.exe
--------------------------------------------------------------------------
The PMIx server's listener thread failed to start. We cannot
continue.
--------------------------------------------------------------------------

I did check to make sure I don't have Homebrew's pmix installed and I do not.

And the configuration that configure spit out is pretty standard for my builds:

Open MPI configuration:
-----------------------
Version: 5.0.7
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: no (disabled)
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: internal
libevent: internal
Open UCC: no
pmix: internal
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: no
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: no
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: no
ROCm support: no

OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no

@ggouaillardet
Copy link
Contributor

@mathomp4 I am sorry I misread your previous message.

The command should be

mpirun -np 2 --mca pml_base_verbose 100 --mca btl_base_verbose 100 --mca btl self,sm ./helloWorld.mpi2.exe

@mathomp4
Copy link
Author

mathomp4 commented Mar 7, 2025

@ggouaillardet Here is the (very verbose!) output:

❯ mpirun -np 2 --mca pml_base_verbose 100 --mca btl_base_verbose 100 --mca btl self,sm ./helloWorld.mpi2.exe
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: registering framework btl components
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component self
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component self register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component sm
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: registering framework btl components
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component self
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component self register function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component sm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component sm register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: opening btl components
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component self
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: component self open function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component sm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: component sm open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component sm register function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: opening btl components
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component self
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: component self open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component sm
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: component sm open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: registering framework pml components
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component cm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: registering framework pml components
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component cm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component cm register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component monitoring
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component cm register function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component monitoring
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component monitoring register function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component monitoring register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component ob1
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component ob1
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component ob1 register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component ob1 register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: found loaded component v
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: found loaded component v
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_register: component v register function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: opening pml components
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component cm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_register: component v register function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: opening pml components
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component cm
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: close: component cm closed
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: close: unloading component cm
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component monitoring
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: component monitoring open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component ob1
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: component ob1 open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: found loaded component v
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: close: component cm closed
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: close: unloading component cm
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component monitoring
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: component monitoring open function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component ob1
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: component ob1 open function successful
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: found loaded component v
[gs6101-Dagobah.ndc.nasa.gov:80341] mca: base: components_open: component v open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: components_open: component v open function successful
[gs6101-Dagobah.ndc.nasa.gov:80342] select: component monitoring not in the include list
[gs6101-Dagobah.ndc.nasa.gov:80342] select: initializing pml component ob1
[gs6101-Dagobah.ndc.nasa.gov:80342] select: initializing btl component self
[gs6101-Dagobah.ndc.nasa.gov:80342] select: init of component self returned success
[gs6101-Dagobah.ndc.nasa.gov:80342] select: initializing btl component sm
[gs6101-Dagobah.ndc.nasa.gov:80342] select: init of component sm returned success
[gs6101-Dagobah.ndc.nasa.gov:80342] select: init returned priority 20
[gs6101-Dagobah.ndc.nasa.gov:80342] select: component v not in the include list
[gs6101-Dagobah.ndc.nasa.gov:80342] selected ob1 best priority 20
[gs6101-Dagobah.ndc.nasa.gov:80342] select: component ob1 selected
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: close: unloading component monitoring
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: close: component v closed
[gs6101-Dagobah.ndc.nasa.gov:80342] mca: base: close: unloading component v
[gs6101-Dagobah.ndc.nasa.gov:80342] check:select: PML modex for process [[43198,1],0] not found
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[gs6101-Dagobah:00000] *** An error occurred in MPI_Init
[gs6101-Dagobah:00000] *** reported by process [2831024129,1]
[gs6101-Dagobah:00000] *** on a NULL communicator
[gs6101-Dagobah:00000] *** Unknown error
[gs6101-Dagobah:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gs6101-Dagobah:00000] ***    and MPI will try to terminate your MPI job as well)
[gs6101-Dagobah.ndc.nasa.gov:80341] select: component monitoring not in the include list
[gs6101-Dagobah.ndc.nasa.gov:80341] select: initializing pml component ob1
[gs6101-Dagobah.ndc.nasa.gov:80341] select: initializing btl component self
[gs6101-Dagobah.ndc.nasa.gov:80341] select: init of component self returned success
[gs6101-Dagobah.ndc.nasa.gov:80341] select: initializing btl component sm
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-gs6101-Dagobah-80339@1,1]
   Exit code:    14
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor

Thanks for the logs.

This explains the error message, but I have no idea how you got there.

[gs6101-Dagobah.ndc.nasa.gov:80342] check:select: PML modex for process [[43198,1],0] not found

I rebuilt this on macOS 15.3.1 on M4 but could not reproduce the issue.
If you are using homebrew, did you try using Open MPI provided by brew?

FWIW, there is no firewall on my box, not sure if this has any impact though.

@mathomp4
Copy link
Author

mathomp4 commented Mar 7, 2025

@ggouaillardet Well, I have usually avoided Open MPI (via brew or other package managers) as I currently have on this laptop for Fortran:

  • gfortran
  • NAG Fortran
  • Lfortran
  • LLVM Flang

and I build and maintain my MPI outside so I know I'm using the right one. My guess is whatever Open MPI brew would install/build wouldn't have the right .mod files for use mpi right? (Or maybe is that not an issue? It seems like that might be near the @jeffhammond MPI ABI level of things...)

Now, this is not a personal laptop (and I don't even have admin powers, so even my homebrew is in userland), so I do have firewalls and things like SentinelOne running. When MPI does work, I get the "Accept incoming connections" dialog box cascade that I've learned to "tune out".

But as I say, Open MPI 4.1 does work.

One thing I might try is reinstalling, well, everything in brew. As I said this happened with an update from macOS 14 to 15 and I already found I had to reinstall things like automake and autoconf.

Do you know if there might be something from brew that Open MPI 5 would use during configure/build time that Open MPI 4 wouldn't?

@Zhenguang-Huang
Copy link

I have a similar issue here. I upgraded the system to macOS 15.3.1 and re-installed everything with port. The OpenMPI version is 5.0.6_1. In addition, I have the error message:

shmem: mmap: an error occurred while determining whether or not /var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T//ompi.local.504/jf.0/2991259649/sm_segment.local.504.b24b0001.0 could be created.

Is this related?

Everything was fine with macOS 14. Besides, I need to add the following flags:

-I/opt/local/include/openmpi-mp -L/opt/local/lib/openmpi-mp

during compilation otherwise the error "Cannot open INCLUDE file "mpif.h": No such file or directory" will occur.

Thanks.

Thanks for the logs.

This explains the error message, but I have no idea how you got there.

[gs6101-Dagobah.ndc.nasa.gov:80342] check:select: PML modex for process [[43198,1],0] not found

I rebuilt this on macOS 15.3.1 on M4 but could not reproduce the issue. If you are using homebrew, did you try using Open MPI provided by brew?

FWIW, there is no firewall on my box, not sure if this has any impact though.

@ggouaillardet
Copy link
Contributor

@Zhenguang-Huang the first error message suggests a path was truncated.

do
export TMPDIR=/tmp
and then try again

@rhc54
Copy link
Contributor

rhc54 commented Mar 12, 2025

I don't think you can do /tmp on the Mac - I believe you have to create a local tmpdir (e.g., $HOME/tmp) and then point TMPDIR to it.

@ggouaillardet
Copy link
Contributor

FWIW, my sm_segment files are in
$TMPDIR/prterun.<hostname>.<pid>.<uid>/1/sm_segment.<hostname>.<uid>.<extra>
but yours is in
$TMPDIR/ompi.<hostname>.<uid>/...

Are you sure you are using mpirun from Open MPI 5?

@Zhenguang-Huang
Copy link

Zhenguang-Huang commented Mar 12, 2025

@ggouaillardet @rhc54 Thank you for the suggestions.

I used tcsh. Initially echo $TMPDIR shows:

/var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T/

after set TMPDIR=/Users/user/tmp_mpi, it shows:

/Users/user/tmp_mpi

But still, mpirun -np 2 helloworld.exe fails as:

shmem: mmap: an error occurred while determining whether or not /var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T//ompi.local.504/jf.0/2534801409/sm_segment.local.504.97160001.0 could be created.

I just read another thread, and tried env OMPI_MCA_btl_sm_backing_directory=/Users/user/tmp_mpi mpirun -np 2 ./helloworld.exe, the shmem error disappeared, but the run still failed:

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[local:00000] *** An error occurred in MPI_Init_thread
[local:00000] *** reported by process [3223781377,1]
[local:00000] *** on a NULL communicator
[local:00000] *** Unknown error
[local:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[local:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-local-77413@1,1]
   Exit code:    14
--------------------------------------------------------------------------

In addition, mpirun -np 1 helloworld.exe runs without any error.

I believe my mpirun is from Open MPI 5 as mpirun -version shows:

mpirun (Open MPI) 5.0.6

Report bugs to https://www.open-mpi.org/community/help/

@ggouaillardet
Copy link
Contributor

I still do not understand why the directory is starts with ompi and not prterun.
@rhc54 can you please shed some light on where the temporary directory name comes from?

Also, setting $TMPDIR works in my environment.

FWIW, I built Open MPI with --with-pmix=internal in order to force the internal PMIx.

@rhc54
Copy link
Contributor

rhc54 commented Mar 12, 2025

I seem to recall pointing this out once before, but I don't believe anyone fixed it. IIRC, the mmap/shmem component is creating its own directory name and NOT using the opal_process_info prefix. So it is winding up somewhere outside of the session directory tree. I believe what happened is that someone hardcoded the legacy session directory name, and so it was never caught.

@Zhenguang-Huang
Copy link

@ggouaillardet I am using port to to install with port install openmpi +gcc +fortran, how should I add --with-pmix=internal correctly? I checked the instruction on the Open MPI website, it seems this option is supported only in building from the source codes.

Thanks again.

@ggouaillardet
Copy link
Contributor

from what I understand, mmap/shmem uses the absolute path it was provided with.

from btl/sm

    if (0 == access("/dev/shm", W_OK)) {
        mca_btl_sm_component.backing_directory = "/dev/shm";
    } else {
        mca_btl_sm_component.backing_directory = opal_process_info.job_session_dir;
    }                  

This is consistent with what I observe on my mac: no /dev/shm so use the session dir
(that is in $TMPDIR) even on OSX

That might be altered via the shmem_mmap_relocate_backing_file option, but it is 0 as default

@ggouaillardet
Copy link
Contributor

@Zhenguang-Huang sorry, I am not familiar with port
What if you simply download the sources and compile from the command line with the usual configure ... && make install?

in case your system changed the default settings, can you try

mpirun --mca shmem_mmap_relocate_backing_file 0 ...` and see how it goes?

@Zhenguang-Huang
Copy link

Zhenguang-Huang commented Mar 12, 2025

@ggouaillardet I am not familiar with installing from the source codes so I follow https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/quickstart.html:

cd openmpi-5.0.6
mkdir build
../configure --with-pmix=internal

and it returned:

============================================================================
== Event libraries
============================================================================
configure: Searching for libevent in default search paths
checking for libevent cppflags... 
checking for libevent ldflags... 
checking for libevent libs... -levent_core -levent_pthreads
checking for libevent static libs... -levent_core -levent_pthreads
checking for event.h... no
checking for event_getcode4name in -levent... no
checking will libevent support be built... no
checking for libev pkg-config name... libev
checking if libev pkg-config module exists... no
configure: Searching for libev in default search paths
checking for libev cppflags... 
checking for libev ldflags... 
checking for libev libs... -lev
checking for libev static libs... -lev
checking for event.h... (cached) no
checking will libev support be built... no
configure: WARNING: Either libevent or libev support is required, but neither
configure: WARNING: was found. Please use the configure options to point us
configure: WARNING: to where we can find one or the other library
configure: error: Cannot continue
configure: ===== done with 3rd-party/openpmix configure =====
configure: error: Could not find viable pmix build.

It stopped with Could not find viable pmix build. I also don't understand the warning about not finding either libevent or libev. libevent 2.1.12 is installed from MacPorts and the event.h file is located at /opt/local/include/event.h. But it just cannot find it.

@ggouaillardet
Copy link
Contributor

this is a bit odd, anyway, that should do the trick:

configure --with-libevent=internal --with-hwloc=internal --with-pmix=internal ...

@ggouaillardet
Copy link
Contributor

@rhc54 I was puzzled by the jf.0 directory name, and here is what I found

from ompi/runtime/ompi_rte.c

    /* retrieve job-session directory info */
    val = NULL;
    OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_NSDIR, &pname, &val, PMIX_STRING);
    if (PMIX_SUCCESS == rc && NULL != val) {
        opal_process_info.job_session_dir = val;
        val = NULL;  // protect the string
    } else {
        /* we need to create something */
        rc = _setup_job_session_dir(&opal_process_info.job_session_dir);
        if (OPAL_SUCCESS != rc) {
            error = "job session directory";
            goto error;
        }
    }

jf.0 is set by _setup_job_session_dir() when PMIX_NSDIR cannot be retrieved.
In my environment, PMIX_NSDIR can be retrieved.

As a proof of concept, I forced rc = PMIX_ERROR and was able to reproduce the error (by forcing btl/sm, otherwise Open MPI falls back to btl/tcp and the program runs)

Bottom line, I am not sure something goes wrong in PMIx, or if the user's Open MPI uses an external and older PMIx.

@rhc54
Copy link
Contributor

rhc54 commented Mar 12, 2025

It is actually being set right here, in ompi/runtime/ompi_rte.c:

static int _setup_job_session_dir(char **sdir)
{
    int rc;
    /* get the effective uid */
    uid_t uid = geteuid();

    if (0 > opal_asprintf(sdir, "%s/ompi.%s.%lu/jf.0/%u",
                          opal_process_info.top_session_dir,
                          opal_process_info.nodename,
                          (unsigned long)uid,
                          opal_process_info.my_name.jobid)) {
        opal_process_info.job_session_dir = NULL;
        return OPAL_ERR_OUT_OF_RESOURCE;
    }

This code gets executed when the application proc was unable to connect back to mpirun for some reason, and so it thinks it is a singleton. You might want to check if the user has networking enabled and a "live" interface. We at least need the ability to do a local loopback.

@Zhenguang-Huang
Copy link

Zhenguang-Huang commented Mar 12, 2025

@ggouaillardet I managed to install it with configure --with-libevent=internal --with-hwloc=internal --with-pmix=internal .... And mpirun --mca shmem_mmap_relocate_backing_file 0 -np 2 helloworld.exe still showed:

[local:10603] shmem: mmap: an error occurred while determining whether or not /var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T//ompi.local.504/jf.0/296157185/sm_segment.local.504.11a70001.0 could be created.
[local:10604] shmem: mmap: an error occurred while determining whether or not /var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T//ompi.local.504/jf.0/296157185/sm_segment.local.504.11a70001.0 could be created.
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[local:00000] *** An error occurred in MPI_Init_thread
[local:00000] *** reported by process [296157185,1]
[local:00000] *** on a NULL communicator
[local:00000] *** Unknown error
[local:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[local:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-local-10602@1,1]
   Exit code:    14

I further tested adding env OMPI_MCA_btl_sm_backing_directory=/Users/user/tmp_mpi, the shmem error disappeared but the run still failed:

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[local:00000] *** An error occurred in MPI_Init_thread
[local:00000] *** reported by process [4120313857,1]
[local:00000] *** on a NULL communicator
[local:00000] *** Unknown error
[local:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[local:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-local-10801@1,1]
   Exit code:    14
--------------------------------------------------------------------------

After changing -np 2 to -np 1, the code would work. One thing that I also noticed is that, in the past, when a new executable was created, mpirun will cause the system to pop up windows asking if internet connections are allowed. But these windows do not show up now.

@rhc54
Copy link
Contributor

rhc54 commented Mar 12, 2025

I suspect the problem @mathomp4 and @Zhenguang-Huang are having is that they have networking disabled on their Macs. OMPI v5 needs at least one network interface to be enabled, even if it is only the loopback interface.

OMPI v4 had a fallback communication channel that did not require a network interface to be available. This was removed for a variety of reasons, but would explain why you can run with OMPI v4 and not with v5. I can explore bringing it back - but no promises.

If you folks could confirm that you do or do not have an available interface, we could all probably avoid a lot more cycles chasing this down.

@ggouaillardet You might want to generate an OMPI PR with something like the following to help reduce the confusion - IIRC, we've seen this problem multiple times now:

diff --git a/ompi/runtime/help-mpi-runtime.txt b/ompi/runtime/help-mpi-runtime.>
index c20b723b3e..331555eb3e 100644
--- a/ompi/runtime/help-mpi-runtime.txt
+++ b/ompi/runtime/help-mpi-runtime.txt
@@ -122,3 +122,14 @@ to execute the job.
 [no-pmix-but]
 No PMIx server was reachable, but a PMI1/2 was detected.
 If srun is being used to launch application,  %d singletons will be started.
+#
+[invalid-singleton]
+At least one application process was unable to connect to its local
+PMIx server, but detected it was launched by a server and is not
+a singleton as it was assigned a non-zero rank. This typically indicates
+the lack of an available network interface by which the application
+process can connect back to the server.
+
+Please check to ensure you have at least one active interface on the node.
+This can be a simple loopback device - no external network access is
+required.
diff --git a/ompi/runtime/ompi_rte.c b/ompi/runtime/ompi_rte.c
index 4e7719c73a..7c78eb4e31 100644
--- a/ompi/runtime/ompi_rte.c
+++ b/ompi/runtime/ompi_rte.c
@@ -583,6 +583,14 @@ int ompi_rte_init(int *pargc, char ***pargv)
         if (PMIX_ERR_UNREACH == ret) {
             bool found_a_pmi = false;
             int n = 0;
+
+            /* if we have a rank greater than 0, then we probably are
+             * part of a failed multi-proc job */
+            if (0 < opal_process_info.myprocid.rank) {
+                opal_show_help("help-mpi-runtime.txt", "invalid-singleton", tr>
+                return OPAL_ERR_SILENT;
+            }
+
             /* if we are in a PMI environment with two tasks or more,
              * we probably do not want to start singletons */
             while (pmi_sentinels[n] != NULL) {

@mathomp4
Copy link
Author

mathomp4 commented Mar 13, 2025

@rhc54 On the box (Mac Studio) in which I get this error, I am currently connected via Ethernet to the internet and via Wifi. Heck, I am writing this comment from that box. So I mean, some sort of network must be available. 🤷🏼

Now, if I drill into my macOS settings, I do see in the Firewall → Options... dialog:

prte     🔴 Block incoming connections

Could that do it? If so, I have no way to change that. That setting is configured via a profile by NASA IT/security and I'm guessing I'd have little sway to change that. I mean, I can try, but...ooh boy.

If that is the cause, then it might be that at least NASA macOS machines can never use Open MPI 5 unless we can get IT to relax that.

@mathomp4
Copy link
Author

One more note. I built Open MPI 5.0.6 via Spack with the same settings I configured with above (well, along with a bug fix, see spack/spack#49463), and I get the same error with Hello World. So I guess "good" news is it's reproducible.

@rhc54
Copy link
Contributor

rhc54 commented Mar 13, 2025

Yeah, that is likely the source of the problem. I agree that getting any org-level IT to relax it is a non-starter. Let me investigate how hard it would be to restore the usock support. It avoids the network connection issue.

@rhc54
Copy link
Contributor

rhc54 commented Mar 13, 2025

You know, what I really don't understand is what this has to do with the OS upgrade - unless your IT people added that firewall setting when you upgraded the OS. Otherwise, this makes no sense as the connection need has nothing to do with the OS level. If that firewall rule was present under Sonoma, then OMPI v5 would have failed there too.

@mathomp4
Copy link
Author

mathomp4 commented Mar 13, 2025

You know, what I really don't understand is what this has to do with the OS upgrade - unless your IT people added that firewall setting when you upgraded the OS. Otherwise, this makes no sense as the connection need has nothing to do with the OS level. If that firewall rule was present under Sonoma, then OMPI v5 would have failed there too.

@rhc54 Let me look around. I'm sort of my group's "beta tester" so I updated to macOS 15 first. The others are still on macOS 14. This could be an update from that... One moment please.

@rhc54
Copy link
Contributor

rhc54 commented Mar 13, 2025

Kewl - thanks! FWIW - I checked about restoring the usock support and the short answer is "doable, but not easy". Probably take me a month or so to complete it (retirement => slow). Would have to be in PMIx v6 as it is too disruptive to bring back to PMIx v5 (which is what OMPI v5 is using). However, once done, you could build a separate PMIx v6 install and then link OMPI v5 to it - and you'd be off and running.

I can update you here when that is ready. Meantime, if your IT dept is able/willing, you could ask if they can relax that connection rule a tad and allow loopback operations while still excluding external inbound connections. Don't know if Mac's firewall is that sophisticated or not - but if they are willing, they could perhaps take a look.

@mathomp4
Copy link
Author

mathomp4 commented Mar 13, 2025

Okay. I looked at a couple macOS 14 folks and they don't have it. But, I haven't yet found someone else on macOS 15 to see if it is there, but my current suspicion is it won't be.

Why? Well, on my colleagues' macOS machines, the Firewall was filled with odd apps. Indeed, they are apps we build/work on that would have triggered the "Accept Incoming Connections" dialog cascade we always get on our machines.

What I'm thinking is that when I updated to macOS 15, my Firewall was maybe reset. I then tried Open MPI 5 for the first time and, in my haste/habit, I pressed "Deny" on that dialog.1 And maybe we boring users have some limited power to Deny apps, but no power to Allow apps.

I'm going to consult with our local sysadmins. It's possible they have some control over whatever is filling in the Firewall entries? I'll let you know.

Footnotes

  1. I mean, in the past, Deny vs Allow never seemed to do anything, so I just pressed something...but maybe that is not the thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants