-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open MPI 5 doesn't work after update to macOS 15.3.1, 4.1 does #13129
Comments
Works fine for me on Mac Sequioa 15.3.1: $ mpirun -n 2 ./hello_c
Hello, world, I am 1 of 2, (Open MPI v5.0.8a1, package: Open MPI [email protected] Distribution, ident: 5.0.8a1, repo rev: v5.0.7-12-gc1d3071a86, Unreleased developer copy, 148)
Hello, world, I am 0 of 2, (Open MPI v5.0.8a1, package: Open MPI [email protected] Distribution, ident: 5.0.8a1, repo rev: v5.0.7-12-gc1d3071a86, Unreleased developer copy, 148)
$ Didn't need to set anything. Did you remember to create a local tmp directory (e.g., $HOME/tmp) and point your tmpdir at it (e.g., export TMPDIR=$HOME/tmp)? Apple started those ridiculously long tmpdir paths years ago that messed things up, so you need the shorter path to make things work. May or may not be the problem here - just something to remember.
Has someone filed a ticket about it? If so, can you cite it here? |
Puzzling, especially as most of OMPI developers are using OSX, in general an up-to-date version. I agree with @rhc54, 5.0 and main runs out of the box for me on 15.3.1. Can you run your mpirun command adding |
@mathomp4 The error is at the MPI layer. can you confirm
works as expected? You might have more luck with
you also want to double check |
First:
So these are the Also, I have no
and I have
but nothing changes if I put it at As for tests per @ggouaillardet, the first does not work:
while a boring one does:
And since the
I did check to make sure I don't have Homebrew's And the configuration that
|
@mathomp4 I am sorry I misread your previous message. The command should be
|
@ggouaillardet Here is the (very verbose!) output:
|
Thanks for the logs. This explains the error message, but I have no idea how you got there.
I rebuilt this on macOS 15.3.1 on M4 but could not reproduce the issue. FWIW, there is no firewall on my box, not sure if this has any impact though. |
@ggouaillardet Well, I have usually avoided Open MPI (via
and I build and maintain my MPI outside so I know I'm using the right one. My guess is whatever Open MPI Now, this is not a personal laptop (and I don't even have admin powers, so even my homebrew is in userland), so I do have firewalls and things like SentinelOne running. When MPI does work, I get the "Accept incoming connections" dialog box cascade that I've learned to "tune out". But as I say, Open MPI 4.1 does work. One thing I might try is reinstalling, well, everything in Do you know if there might be something from |
I have a similar issue here. I upgraded the system to macOS 15.3.1 and re-installed everything with port. The OpenMPI version is 5.0.6_1. In addition, I have the error message: shmem: mmap: an error occurred while determining whether or not /var/folders/6j/vzv3ldfs7wzbhlqfw2yptx100000gr/T//ompi.local.504/jf.0/2991259649/sm_segment.local.504.b24b0001.0 could be created. Is this related? Everything was fine with macOS 14. Besides, I need to add the following flags: -I/opt/local/include/openmpi-mp -L/opt/local/lib/openmpi-mp during compilation otherwise the error "Cannot open INCLUDE file "mpif.h": No such file or directory" will occur. Thanks.
|
@Zhenguang-Huang the first error message suggests a path was truncated. do |
I don't think you can do |
FWIW, my Are you sure you are using |
@ggouaillardet @rhc54 Thank you for the suggestions. I used tcsh. Initially
after
But still,
I just read another thread, and tried
In addition, I believe my
|
I still do not understand why the directory is starts with Also, setting FWIW, I built Open MPI with |
I seem to recall pointing this out once before, but I don't believe anyone fixed it. IIRC, the mmap/shmem component is creating its own directory name and NOT using the opal_process_info prefix. So it is winding up somewhere outside of the session directory tree. I believe what happened is that someone hardcoded the legacy session directory name, and so it was never caught. |
@ggouaillardet I am using port to to install with Thanks again. |
from what I understand, from
This is consistent with what I observe on my mac: no That might be altered via the |
@Zhenguang-Huang sorry, I am not familiar with in case your system changed the default settings, can you try
|
@ggouaillardet I am not familiar with installing from the source codes so I follow
and it returned:
It stopped with |
this is a bit odd, anyway, that should do the trick:
|
@rhc54 I was puzzled by the from
As a proof of concept, I forced Bottom line, I am not sure something goes wrong in |
It is actually being set right here, in ompi/runtime/ompi_rte.c: static int _setup_job_session_dir(char **sdir)
{
int rc;
/* get the effective uid */
uid_t uid = geteuid();
if (0 > opal_asprintf(sdir, "%s/ompi.%s.%lu/jf.0/%u",
opal_process_info.top_session_dir,
opal_process_info.nodename,
(unsigned long)uid,
opal_process_info.my_name.jobid)) {
opal_process_info.job_session_dir = NULL;
return OPAL_ERR_OUT_OF_RESOURCE;
} This code gets executed when the application proc was unable to connect back to |
@ggouaillardet I managed to install it with
I further tested adding
After changing |
I suspect the problem @mathomp4 and @Zhenguang-Huang are having is that they have networking disabled on their Macs. OMPI v5 needs at least one network interface to be enabled, even if it is only the loopback interface. OMPI v4 had a fallback communication channel that did not require a network interface to be available. This was removed for a variety of reasons, but would explain why you can run with OMPI v4 and not with v5. I can explore bringing it back - but no promises. If you folks could confirm that you do or do not have an available interface, we could all probably avoid a lot more cycles chasing this down. @ggouaillardet You might want to generate an OMPI PR with something like the following to help reduce the confusion - IIRC, we've seen this problem multiple times now: diff --git a/ompi/runtime/help-mpi-runtime.txt b/ompi/runtime/help-mpi-runtime.>
index c20b723b3e..331555eb3e 100644
--- a/ompi/runtime/help-mpi-runtime.txt
+++ b/ompi/runtime/help-mpi-runtime.txt
@@ -122,3 +122,14 @@ to execute the job.
[no-pmix-but]
No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, %d singletons will be started.
+#
+[invalid-singleton]
+At least one application process was unable to connect to its local
+PMIx server, but detected it was launched by a server and is not
+a singleton as it was assigned a non-zero rank. This typically indicates
+the lack of an available network interface by which the application
+process can connect back to the server.
+
+Please check to ensure you have at least one active interface on the node.
+This can be a simple loopback device - no external network access is
+required.
diff --git a/ompi/runtime/ompi_rte.c b/ompi/runtime/ompi_rte.c
index 4e7719c73a..7c78eb4e31 100644
--- a/ompi/runtime/ompi_rte.c
+++ b/ompi/runtime/ompi_rte.c
@@ -583,6 +583,14 @@ int ompi_rte_init(int *pargc, char ***pargv)
if (PMIX_ERR_UNREACH == ret) {
bool found_a_pmi = false;
int n = 0;
+
+ /* if we have a rank greater than 0, then we probably are
+ * part of a failed multi-proc job */
+ if (0 < opal_process_info.myprocid.rank) {
+ opal_show_help("help-mpi-runtime.txt", "invalid-singleton", tr>
+ return OPAL_ERR_SILENT;
+ }
+
/* if we are in a PMI environment with two tasks or more,
* we probably do not want to start singletons */
while (pmi_sentinels[n] != NULL) { |
@rhc54 On the box (Mac Studio) in which I get this error, I am currently connected via Ethernet to the internet and via Wifi. Heck, I am writing this comment from that box. So I mean, some sort of network must be available. 🤷🏼 Now, if I drill into my macOS settings, I do see in the Firewall → Options... dialog:
Could that do it? If so, I have no way to change that. That setting is configured via a profile by NASA IT/security and I'm guessing I'd have little sway to change that. I mean, I can try, but...ooh boy. If that is the cause, then it might be that at least NASA macOS machines can never use Open MPI 5 unless we can get IT to relax that. |
One more note. I built Open MPI 5.0.6 via Spack with the same settings I configured with above (well, along with a bug fix, see spack/spack#49463), and I get the same error with Hello World. So I guess "good" news is it's reproducible. |
Yeah, that is likely the source of the problem. I agree that getting any org-level IT to relax it is a non-starter. Let me investigate how hard it would be to restore the usock support. It avoids the network connection issue. |
You know, what I really don't understand is what this has to do with the OS upgrade - unless your IT people added that firewall setting when you upgraded the OS. Otherwise, this makes no sense as the connection need has nothing to do with the OS level. If that firewall rule was present under Sonoma, then OMPI v5 would have failed there too. |
@rhc54 Let me look around. I'm sort of my group's "beta tester" so I updated to macOS 15 first. The others are still on macOS 14. This could be an update from that... One moment please. |
Kewl - thanks! FWIW - I checked about restoring the usock support and the short answer is "doable, but not easy". Probably take me a month or so to complete it (retirement => slow). Would have to be in PMIx v6 as it is too disruptive to bring back to PMIx v5 (which is what OMPI v5 is using). However, once done, you could build a separate PMIx v6 install and then link OMPI v5 to it - and you'd be off and running. I can update you here when that is ready. Meantime, if your IT dept is able/willing, you could ask if they can relax that connection rule a tad and allow loopback operations while still excluding external inbound connections. Don't know if Mac's firewall is that sophisticated or not - but if they are willing, they could perhaps take a look. |
Okay. I looked at a couple macOS 14 folks and they don't have it. But, I haven't yet found someone else on macOS 15 to see if it is there, but my current suspicion is it won't be. Why? Well, on my colleagues' macOS machines, the Firewall was filled with odd apps. Indeed, they are apps we build/work on that would have triggered the "Accept Incoming Connections" dialog cascade we always get on our machines. What I'm thinking is that when I updated to macOS 15, my Firewall was maybe reset. I then tried Open MPI 5 for the first time and, in my haste/habit, I pressed "Deny" on that dialog.1 And maybe we boring users have some limited power to Deny apps, but no power to Allow apps. I'm going to consult with our local sysadmins. It's possible they have some control over whatever is filling in the Firewall entries? I'll let you know. Footnotes
|
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed by hand via:
Please describe the system on which you are running
Details of the problem
Recently, I updated my Mac Studio to macOS 15.3.1 from macOS 14. The upgrade went fine until recently when I tried to run some MPI code and found it failing. I then tried HelloWorld that that fails. But that was with an Open MPI 5.0.5 built back in the Sonoma days. So I grabbed the 5.0.7 tarfile and built as shown above (exactly how I built with 5.0.5), but no joy. When I try and run helloworld I get this:
Now, I looked around the internet (and these issues) and found things like #12273 which suggested
--pmixmca ptl_tcp_if_include lo0
, but:So not that. But threads also said to try
--mca ptl_tcp_if_include lo0
so:I also tried a few other things in combination that I had commented in a modulefile:
but nothing helped.
But, on a supercomputer cluster I work on, Open MPI 5 has just never worked for us, so I thought "Let me try Open MPI 4.1" so I grabbed 4.1.8 and built as:
(exactly the same as 5.0.7 save where I'm installing) and then:
So, good news, I can keep working with Open MPI 4.1. Bad news, I'm unsure why Open MPI 5 stopped working all of the sudden. It did work before the OS update.
The text was updated successfully, but these errors were encountered: