Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with running sanity check for OpenMPI #3541

Open
asp8200 opened this issue Dec 26, 2024 · 13 comments
Open

Issues with running sanity check for OpenMPI #3541

asp8200 opened this issue Dec 26, 2024 · 13 comments

Comments

@asp8200
Copy link

asp8200 commented Dec 26, 2024

EasyBuild couldn't run the sanity check for OpenMPI-5.0.3-GCC-13.3.0.eb

I ran EasyBuild 4.9.4 (framework: 4.9.4, easyblocks: 4.9.4) on Rocky Linux v9.4 with Python v3.9.18.

I did manage to run eb OpenMPI-5.0.3-GCC-13.3.0.eb--robot --skip-sanity-check, and then afterwards I ran eb OpenMPI-5.0.3-GCC-13.3.0.eb --robot --sanity-check-only, which gave me the following error msg:

== sanity checking...
ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/easybuild/main.py", line 137, in build_and_install_software
    (ec_res['success'], app_log, err) = build_and_install_one(ec, init_env)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 4276, in build_and_install_one
    result = app.run_all_steps(run_test_cases=run_test_cases)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 4155, in run_all_steps
    self.run_step(step_name, step_methods)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 3990, in run_step
    step_method(self)()
  File "/usr/local/lib/python3.9/site-packages/easybuild/easyblocks/o/openmpi.py", line 222, in sanity_check_step
    ranks = min(8, self.cfg['parallel'])
TypeError: '<' not supported between instances of 'NoneType' and 'int'

The problem seems to be that self.cfg['parallel'] in line 222 evaluates to None. I tried to add --parallel in the eb-command, that is, eb OpenMPI-5.0.3-GCC-13.3.0.eb --sanity-check-only --parallel=10 but that didn't help.

Hence I chaned line 222 in openmpi.py to

ranks = 8 if self.cfg['parallel'] == None else min(8, self.cfg['parallel'])

That got the sanity check running a bit further:

== sanity checking...
ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/easybuild/main.py", line 137, in build_and_install_software
    (ec_res['success'], app_log, err) = build_and_install_one(ec, init_env)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 4276, in build_and_install_one
    result = app.run_all_steps(run_test_cases=run_test_cases)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 4155, in run_all_steps
    self.run_step(step_name, step_methods)
  File "/usr/local/lib/python3.9/site-packages/easybuild/framework/easyblock.py", line 3990, in run_step
    step_method(self)()
  File "/usr/local/lib/python3.9/site-packages/easybuild/easyblocks/o/openmpi.py", line 234, in sanity_check_step
    src_path = os.path.join(self.cfg['start_dir'], srcdir, src)
  File "/usr/lib64/python3.9/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

I hacked my way around that by changing the code around line 234 in openmpi.py into

            scr_path_exists = False
            if self.cfg['start_dir'] != None:
                src_path = os.path.join(self.cfg['start_dir'], srcdir, src)
                scr_path_exists = os.path.exists(src_path)
            if scr_path_exists:

and then OpenMPI-5.0.3-GCC-13.3.0.eb passed the sanity check.

Details also described on EasyBuild Slack.

Thanks to @sassy-crick for helping out with the debug.

I'm very new to EasyBuild but I'd be happy to try and make a PR with the changes as listed above - if you agree that there is an issue with openmpi.py.

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

I tried doing a re-install and the above-mentioned issue popped up again.

self.cfg['start_dir'] evaluates to /users/tools/easybuild/build/OpenMPI/5.0.3/GCC-13.3.0/openmpi-5.0.3/ which seems okay to me.

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

I had to disable the sanity-check in order to have the installation complete without errors.

Then when I just try to run the sanity-check with

eb OpenMPI-5.0.3-GCC-13.3.0.eb --robot --force --sanity-check-only

From stdout:

== FAILED: Installation ended unsuccessfully (build directory: /ngc/tools/easybuild/build/OpenMPI/5.0.3/GCC-13.3.0): build failed (first 300 chars): Sanity check failed: no file found at 'bin/prterun' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'include/prte.h' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'lib/libprrte.so' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0 (took 0 secs)
== Results of the build can be found in the log file(s) /tmp/eb-wqkkj9ok/easybuild-OpenMPI-5.0.3-20250116.184634.fhxvy.log
ERROR: Build of /users/people/andped/eb_test/OpenMPI-5.0.3-GCC-13.3.0.eb failed (err: "build failed (first 300 chars): Sanity check failed: no file found at 'bin/prterun' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0\nno file found at 'include/prte.h' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0\nno file found at 'lib/libprrte.so' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0")

From end of log:

== 2025-01-16 18:46:34,662 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/software/EasyBuild/4.9.4/lib/python3.9/site-packages/easybuild/base/exceptions.py:126 in __init__): Sanity check failed: no file found at 'bin/prterun' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'include/prte.h' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'lib/libprrte.so' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0 (at easybuild/software/EasyBuild/4.9.4/lib/python3.9/site-packages/easybuild/framework/easyblock.py:3670 in _sanity_check_step)
== 2025-01-16 18:46:34,662 filetools.py:2025 INFO Removing lock /ngc/tools/easybuild/software/.locks/_ngc_tools_easybuild_software_OpenMPI_5.0.3-GCC-13.3.0.lock...
== 2025-01-16 18:46:34,666 filetools.py:385 INFO Path /ngc/tools/easybuild/software/.locks/_ngc_tools_easybuild_software_OpenMPI_5.0.3-GCC-13.3.0.lock successfully removed.
== 2025-01-16 18:46:34,666 filetools.py:2029 INFO Lock removed: /ngc/tools/easybuild/software/.locks/_ngc_tools_easybuild_software_OpenMPI_5.0.3-GCC-13.3.0.lock
== 2025-01-16 18:46:34,666 easyblock.py:4297 WARNING build failed (first 300 chars): Sanity check failed: no file found at 'bin/prterun' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'include/prte.h' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0
no file found at 'lib/libprrte.so' in /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0

Indeed, I can't find the files prte.h, libprrte.so and prterun under /ngc/tools/easybuild/software/OpenMPI/5.0.3-GCC-13.3.0/. In fact, no files named *prte* in that folder.

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

You should probably review the build logs, skipping the sanity check is usually not a good idea. You never included the error from your original comment, but it would be good to know what that was.

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

Looking back at the Slack thread, I see it was related to the "Hello world" MPI program hanging. That has been known to happen when OpenMPI uses UCX or libfabric. Do you have infiniband in the system where you are doing the builds?

I would make sure that you can successfully compile and run an MPI code with the module. You may need to tune the OpenMPI a little, for example in the test cluster for easyconfig PRs we set

export FI_PROVIDER="^psm3"

due to hangs similar to this (and there is no infiniband).

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

Hi @ocaisa. Thanks for your input. I just managed to get the eb-installation to complete without errors (and without skipping the sanity check).

I had to add the following

configopts = '--with-prrte=/ngc/tools/easybuild/software/PRRTE/3.0.5-GCCcore-13.3.0'

to the recipe OpenMPI-5.0.3-GCC-13.3.0.eb.

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

Hmm, that should already be covered by

known_dependencies.append('PRRTE')

Was the PRRTE dependency included in your easyconfig recipe?

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

Looking back at the Slack thread, I see it was related to the "Hello world" MPI program hanging. That has been known to happen when OpenMPI uses UCX or libfabric. Do you have infiniband in the system where you are doing the builds?

I would make sure that you can successfully compile and run an MPI code with the module. You may need to tune the OpenMPI a little, for example in the test cluster for easyconfig PRs we set

export FI_PROVIDER="^psm3"

due to hangs similar to this (and there is no infiniband).

I tried doing a little test of OpenMPI/5.0.3-GCC-13.3.0. I got test.c being

#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    printf("World size: %d\n", world_size);
    MPI_Finalize();
    return 0;
}

And I load OpenMPI/5.0.3-GCC-13.3.0 (which also loads libfabric/1.21.0-GCCcore-13.3.0), and I run mpirun -np 4 ./test then I get

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: cld062-0010
  Location: mtl_ofi_component.c:515
  Error: No data available (61)
--------------------------------------------------------------------------
World size: 4
World size: 4
World size: 4
World size: 4

By the way, I previously had another error message which I got rid of by introducing /etc/security/limits.d/99-memlock.conf with content

*   soft    memlock    unlimited
*   hard    memlock    unlimited

Anyways, I can get rid of the above mentioned error by setting FI_PROVIDER to tcp.

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

Hmm, that should already be covered by

easybuild-easyblocks/easybuild/easyblocks/o/openmpi.py

Line 76 in 3469151

known_dependencies.append('PRRTE')
Was the PRRTE dependency included in your easyconfig recipe?

yes, in OpenMPI-5.0.3-GCC-13.3.0.eb, I got:

builddependencies = [
    ('pkgconf', '2.2.0'),
    ('Autotools', '20231222'),
]

dependencies = [
    ('zlib', '1.3.1'),
    ('hwloc', '2.10.0'),
    ('libevent', '2.1.12'),
    ('UCX', '1.16.0'),
    ('libfabric', '1.21.0'),
    ('PMIx', '5.0.2'),
    ('PRRTE', '3.0.5'),
    ('UCC', '1.3.0'),
]

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

Strange, the exact option you added should have been there already from what I see in the easyblock. Can you compare the configure command with and without the new configopts to see the difference (no need for a build, you can see with already with -x)

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

Strange, the exact option you added should have been there already from what I see in the easyblock. Can you compare the configure command with and without the new configopts to see the difference (no need for a build, you can see with already with -x)

Let's see if I understood that correctly. I did:

$eb OpenMPI-5.0.3-GCC-13.3.0.eb -x > with_configopts.txt
# Then removed the configopts line from OpenMPI-5.0.3-GCC-13.3.0.eb
$eb OpenMPI-5.0.3-GCC-13.3.0.eb -x > without_configopts.txt
$ diff with_configopts.txt without_configopts.txt
1,2c1,2
< == Temporary log file in case of crash /tmp/eb-81rn7qqs/easybuild-2d7_xm3a.log
< file /tmp/eb-81rn7qqs/fake_vsc_7iwb7st3 removed
---
> == Temporary log file in case of crash /tmp/eb-74v8uirj/easybuild-_aezdgw6.log
> file /tmp/eb-74v8uirj/fake_vsc_19osiycv removed
311,312c311,312
< directory /ngc/tools/easybuild/software/.locks/_tmp_eb-81rn7qqs___ROOT___ngc_tools_easybuild_software_OpenMPI_5.0.3-GCC-13.3.0.lock removed
< == COMPLETED: Installation ended successfully (took 3 secs)
---
> directory /ngc/tools/easybuild/software/.locks/_tmp_eb-74v8uirj___ROOT___ngc_tools_easybuild_software_OpenMPI_5.0.3-GCC-13.3.0.lock removed
> == COMPLETED: Installation ended successfully (took 4 secs)
323,324c323,324
< == Temporary log file(s) /tmp/eb-81rn7qqs/easybuild-2d7_xm3a.log* have been removed.
< == Temporary directory /tmp/eb-81rn7qqs has been removed.
---
> == Temporary log file(s) /tmp/eb-74v8uirj/easybuild-_aezdgw6.log* have been removed.
> == Temporary directory /tmp/eb-74v8uirj has been removed.

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

In those files it should say the configure command used, and indeed they seem to be identical...so I am not sure what changed

@asp8200
Copy link
Author

asp8200 commented Jan 16, 2025

Strange indeed. You reckon that OpenMPI/5.0.3-GCC-13.3.0, libfabric/1.21.0-GCCcore-13.3.0 and PRRTE/3.0.5-GCCcore-13.3.0 have been installed properly on my system? (I realise that may be hard to tell solely from the information I've provided above.)

@ocaisa
Copy link
Member

ocaisa commented Jan 16, 2025

I would say yes, if OpenMPI is passing it's sanity check, then things are fine, just not sure what triggered the difference (perhaps the memlock limits?). If you have a fast interconnect you can install the OSU benchmarks and check ping-pong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants