-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA-aware MPI build on fresh Ubuntu 24.04 LTS, MPIX_Query_cuda_support()
returns zero
#13130
Comments
I am unable to replicate, as long as I run your example on a machine with CUDA devices attached it works just as expected. If I run on a machine without GPUs then it fails are runtime. Can you run |
Thanks for responding so quickly --- here's the output you asked for:
I've also tried to add
I think these lines from
It seems that most of the components can find Finally, both machines were running with CUDA GPUs inside. Here's some relevant output for that:
Admittedly, there's a mismatch in CUDA versions between the driver and compiler. Not sure if that's the issue. |
You are correct, most of the components building as DSO found CUDA. However, coll:cuda decided to be built statically and failed. Something is weird with your build because at the end the CUDA accelerator module has not been build, that's why you don't see it on the output I asked you for, and it fails when you try to force loading it. If you go in build directory then |
There is a Makefile. Here's the output now
|
The CUDA accelerator component is build, but not loaded.
|
FYI I'm still getting the same behaviour from the compile / run commands for |
Everything seems to be in place, but the CUDA accelerator component is not loaded. |
|
Everything looks normal. Let's make sure launching an app does not screw up the environment |
Btw I've not added |
that might be a reason, not PATH but LD_LIBRARY_PATH. Most of the components are build statically in the libmpi.so with a few exceptions, and CUDA-based components are part of these exceptions. But I'm slightly skeptical as ompi_info managed to find the CUDA shared library and all the processes are local so they should inherit the mpirun environement. But just in case you can try export LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
/opt/openmpi/bin/mpirun -np 1 -x LD_LIBRARY_PATH ./mpi_check I'm running out of ideas unfortunately. |
Unfortunately still the same behaviour :( Thank you so much for taking the time. You said you were unable to reproduce this error --- could you tell me what setup you used on your end to produce a working CUDA-aware OpenMPI build on Ubuntu 24.04 LTS? If there's a docker container that has a working installation that I could run my code in, that would work too. I'm also really puzzled that the output of simply
|
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Obtained from https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.gz
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
Details of the problem
I'm really struggling to run CUDA-aware MPI on just one node. I want to do this so that I can test my code locally before deploying to a cluster. I've reproduced this on a fresh install of Ubuntu 24.04 on two different machines.
Here's my install steps:
Now, I build a very simple test program:
This was built with:
Then, we get this output:
However, if I just run
./mpi_check
, i.e. nompirun
, I get this output:There's no other MPI installations, this was reproduced on two independent machines.
Perhaps I'm missing a step, or missing some configuration, but I've tried lots of variations of each of the above commands to no avail, and (I think?) I've followed the install instructions in the documentation correctly. So I believe it is a bug.
If I'm missing something, please let me know. Also please let me know if you'd like the
config.out
andmake.out
log files.The text was updated successfully, but these errors were encountered: