Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

trey-ornl
Copy link
Contributor

Various changes to config_machines.xml in preparation for system software updates on Frontier on 2025-02-18. Within the frontier-scream-gpu machine:

  • Change lmod paths from /usr/share to /opt/cray/pe because the /usr/share software is not maintained and is not available on internal OLCF test computers.
  • Update craygnuamdgpu to latest version of cpe available on Frontier, cpe/24.11.
  • Add explicit versions to other modules in craygnuamdgpu. I used to think this wasn't needed, since module load cpe/24.11 should set all the appropriate defaults. I recently discovered that this setting of defaults only works if the module load cpe/24.11 is a separate command before the other module loads. Cime uses a single module load with a list of modules, which does not update the defaults. I selected the explicit versions based on the cpe/24.11 defaults, including the default for rocm, rocm/6.2.4.
  • Remove the libfabric module from craygnuamdgpu. On Feb 18, the default libfabric will change from libfabric/1.20.1 to libfabric/1.22.0, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed the libfabric module from craygnuamdgpu so that it will stay with the default when it changes.
  • Change from libfabric/1.15.2.0 to libfabric/1.20.1 for crayclang-scream. On Feb 18, libfabric/1.15.2.0 will disappear from Frontier. And the new default, libfabric/1.22.0, does not support Cray MPI versions before cray-mpich/8.1.28. Because crayclang-scream is stuck back at cpe/22.12 and cray-mpich/8.1.26, we have to use libfabric/1.20.1. We previously switched from libfabric/1.20.1 to libfabric/1.15.2.0 because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean that crayclang-scream will be using a version of libfabric that isn't officially supported with the libfabric/1.22.0 drivers, mais c'est la vie.

I ran an 8-node test of NE30 on Frontier and on our internal system (for libfabric/1.22.0) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported in e3sm_timing.t.*.

Compiler cpe rocm libfabric CPM:ATM_RUN
crayclang-scream 22.12 5.4.0 1.15.2.0 15.855: 16.571
1.20.1 15.885: 16.683
craygnuamdgpu 24.07 6.2.0 1.15.2.0 14.850: 15.594
24.11 6.2.4 1.15.2.0 14.727: 15.644
1.20.1 14.729: 15.447
1.22.0 14.990: 15.601

It would probably be good to confirm with NE1024 runs.

@grnydawn
Copy link
Contributor

grnydawn commented Feb 6, 2025

@jgfouca , Just to check, do you think crayclang-scream needs any changes due to this system software updates on Frontier?

@jgfouca
Copy link
Member

jgfouca commented Feb 6, 2025

@grnydawn my guess, looking at the module changes, is no.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 6, 2025

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

@trey-ornl
Copy link
Contributor Author

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test libfabric/1.22.0 on the internal system.

I compared BFB using bfbhash> 234 on the NE30 runs.

  • Changing crayclang-scream from libfabric/1.15.2.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu from cpe/24.07+rocm/6.2.0 to cpe/24.11+rocm/6.2.4, both with libfabric/1.15.2.0, was not BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.15.20.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.20.1 on Frontier to libfabric/1.22.0 on our internal computer was not BFB.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 7, 2025

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

@grnydawn
Copy link
Contributor

grnydawn commented Feb 7, 2025

@jgfouca , @rljacob could you assign at least one reviewer for this PR?

@rljacob
Copy link
Member

rljacob commented Feb 7, 2025

I would like us to settle the machine/compiler definition issue (#6773) first and then apply these changes to the unified entry.

@rljacob rljacob requested a review from jgfouca February 7, 2025 20:44
@xylar
Copy link
Contributor

xylar commented Feb 7, 2025

Would this also be a good chance to set the missing MPICH_GPU_SUPPORT_ENABLED variables? See
E3SM-Project#198
E3SM-Project/polaris#275
E3SM-Project#196
E3SM-Project/mache#231

@trey-ornl
Copy link
Contributor Author

@rljacob, @xylar This pull request has some urgency, as it is needed before the Frontier software changes scheduled for February 18.

@rljacob
Copy link
Member

rljacob commented Feb 7, 2025

I'm aware but 11 days seems like enough time to settle the machine/compiler name issue.

@xylar
Copy link
Contributor

xylar commented Feb 7, 2025

Okay, the MPICH_GPU_SUPPORT_ENABLED issue can be addressed separately. It just seemed like it would fit an spare us another PR. This was done for Perlmuter nearly three ago: #4833

@trey-ornl
Copy link
Contributor Author

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

I am trying an NE1024 test from @whannah1, and I got a seg fault for the new compiler setting. I'm investigating with NE30 and NE256 versions of the same test. To be continued...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants