Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: how to reproduce errors using genconfig and RHEL8 #13022

Closed
jhux2 opened this issue May 17, 2024 · 13 comments
Closed

Framework: how to reproduce errors using genconfig and RHEL8 #13022

jhux2 opened this issue May 17, 2024 · 13 comments
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: question

Comments

@jhux2
Copy link
Member

jhux2 commented May 17, 2024

Question

@trilinos/framework @sebrowne

Many SNL workstations and CEE resources are being upgraded to RHEL8 or RHEL9. How should developers reproduce errors that show up on the dashboard or in PR testing, which uses RHEL7? The genconfig instructions are RHEL7 specific.

@jhux2 jhux2 added type: question PA: Framework Issues that fall under the Trilinos Framework Product Area labels May 17, 2024
@jhux2
Copy link
Member Author

jhux2 commented May 17, 2024

On ascicgpu031, which is running RHEL8, step 8 of the PR-reproducer instructions results in an error:

realpath: missing operand
Try 'realpath --help' for more information.
Using system 'rhel7' based on matching hostname 'ascicgpu031'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5' in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested:
"sems-openmpi/4.0.5-cuda-11.4.2"

@jhux2 jhux2 pinned this issue May 17, 2024
@jhux2
Copy link
Member Author

jhux2 commented May 17, 2024

Opened a related SEMS ticket: SEMSHELPD-3859.

@achauphan
Copy link
Contributor

@jhux2 wanted to update you that we are working on this. The error that you are running into is exactly what's hitting our PR systems for that specific cuda no-uvm build due to the rhel8 upgrade.

@achauphan
Copy link
Contributor

achauphan commented Jun 11, 2024

Circling back on this, (@jhux2 you may already be aware of this work around, so for anyone else),

Since ascicgpu machines are rhel8 and we do not have a rhel8 cuda config ready yet, in order to reproduce one of the existing rhel7 configurations on the rhel8 machines, you will need to use SEMS' rhel7 modules. This is the work around that is currently implemented for the rhel7 PR tests that run on our rhel8 machines.

The work around is that you will have to make your own copy of /projects/sems/modulefiles/utils/sems-modules-init.sh, add the rhel8 hostname you're using to the bottom of the script, and source that script. This will load the rhel7 SEMS modules on a rhel8 machine to test a rhel7 PR configuration.

Once we have finalized the rhel8 cuda config, this work around will not be needed we will be able to use the SEMS rhel8 modules.

I've modified the wikipage with this same comment for now.

UPDATE: The bypass has been removed from the official copy of sems-modules-init script. This is due to the transition that most of the Trilinos PR tests have been converted to using RHEL8 configurations (including GPU tests). If you still need to replicate a RHEL7 environment on a RHEL8 machine, please let me know and I can help you with that as I still have the bypass script.

@hkthorn
Copy link
Contributor

hkthorn commented Jun 11, 2024

@achauphan Thanks! I was able to use these instructions to get building on ascicgpu030 again!

@csiefer2
Copy link
Member

@achauphan I'm getting:

+==============================================================================+
|   ERROR:  The following section(s) in your config-specs.ini file
|           do not match any systems listed in
|           'supported-systems.ini':

It really thinks my machine is RHEL8 (which it is).

@achauphan
Copy link
Contributor

@csiefer2 for which configuration + machine is this happening on and is this happening with the bypass?

@csiefer2
Copy link
Member

@achauphan
Machine: One of the ascicgpus
Config: sems-gnu-8.3.0-openmpi-1.10.1

The bypass only generates an error in the shell script, so clearly I'm misunderstanding the instructions somehow.

@sebrowne
Copy link
Contributor

sebrowne commented Jun 27, 2024

@csiefer2 I just (this morning, hasn't taken effect in the current AT run) pulled the trigger on migrating those builds (GCC + OpenMPI) to their updated RHEL8 counterparts. We're at the point where only the Intel build should be using the RHEL7 configuration (and we hope to transition that tomorrow morning).

Apologies for the instability this week, we're running up against a pretty rapid deadline and appreciate everybody's understanding!

@jhux2
Copy link
Member Author

jhux2 commented Jul 27, 2024

I've found that not all of the builds are reproducible. I guess from the error message that this is known/expected? For example, the following doesn't work for me:

export GENCONFIG_BUILD_NAME="rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables"
source $TRILINOS_SRC/packages/framework/GenConfig/gen-config.sh --force --cmake-fragment gen-config.cmake $GENCONFIG_BUILD_NAME $TRILINOS_SRC

Using system 'rhel7' based on matching hostname 'ascicgpu032'.
Overriding system to 'rhel8' based on specification in build name 'rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.

+==============================================================================+
|   ERROR:  Unable to find alias or environment name for system 'rhel8' in
|           build name 'rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
|
|   - Supported Environments for 'rhel8':
|     - aue-gcc-openmpi
|     - cuda-gcc-openmpi
|     - gcc-openmpi
|     - gcc-serial
|     - oneapi-intelmpi
|     - sems-clang-11.0.1-openmpi-4.0.5-serial
|     - sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6
|     - sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.1.4
|     - sems-gnu-8.5.0-openmpi-4.1.6-serial
|     - sems-gnu-8.5.0-serial
|
|   See `../../../../home/jhu/trilinos/Trilinos-ihalim/packages/framework/ini-files/supported-envs.ini` for details on the available environments.
|
|   To force-load an environment see the guidance in the `--help` output.
|
+==============================================================================+

@achauphan
Copy link
Contributor

@jhux2 do you know how old the branch you're using is? The sems-gnu-8.5.0-openmpi-4.1.6-openmp was added on 06-18-24 to develop branch and can be seen here

@jhux2
Copy link
Member Author

jhux2 commented Jul 27, 2024

@achauphan Yes, that's the issue, thank you. The branch that I was using was just a bit too old to have that commit.

@jhux2
Copy link
Member Author

jhux2 commented Jul 30, 2024

I'm closing this as fixed. Thank you, @achauphan.

@jhux2 jhux2 closed this as completed Jul 30, 2024
@jhux2 jhux2 unpinned this issue Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: question
Projects
None yet
Development

No branches or pull requests

5 participants