Update config_machines.xml for Feb 18 software update on Frontier. #6977

trey-ornl · 2025-02-06T00:48:32Z

Various changes to config_machines.xml in preparation for system software updates on Frontier on 2025-02-18. Within the frontier-scream-gpu machine:

Change lmod paths from /usr/share to /opt/cray/pe because the /usr/share software is not maintained and is not available on internal OLCF test computers.
Update craygnuamdgpu to latest version of cpe available on Frontier, cpe/24.11.
Add explicit versions to other modules in craygnuamdgpu. I used to think this wasn't needed, since module load cpe/24.11 should set all the appropriate defaults. I recently discovered that this setting of defaults only works if the module load cpe/24.11 is a separate command before the other module loads. Cime uses a single module load with a list of modules, which does not update the defaults. I selected the explicit versions based on the cpe/24.11 defaults, including the default for rocm, rocm/6.2.4.
Remove the libfabric module from craygnuamdgpu. On Feb 18, the default libfabric will change from libfabric/1.20.1 to libfabric/1.22.0, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed the libfabric module from craygnuamdgpu so that it will stay with the default when it changes.
Change from libfabric/1.15.2.0 to libfabric/1.20.1 for crayclang-scream. On Feb 18, libfabric/1.15.2.0 will disappear from Frontier. And the new default, libfabric/1.22.0, does not support Cray MPI versions before cray-mpich/8.1.28. Because crayclang-scream is stuck back at cpe/22.12 and cray-mpich/8.1.26, we have to use libfabric/1.20.1. We previously switched from libfabric/1.20.1 to libfabric/1.15.2.0 because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean that crayclang-scream will be using a version of libfabric that isn't officially supported with the libfabric/1.22.0 drivers, mais c'est la vie.

I ran an 8-node test of NE30 on Frontier and on our internal system (for libfabric/1.22.0) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported in e3sm_timing.t.*.

Compiler	`cpe`	`rocm`	`libfabric`	`CPM:ATM_RUN`
`crayclang-scream`	22.12	5.4.0	1.15.2.0	15.855: 16.571
			1.20.1	15.885: 16.683
`craygnuamdgpu`	24.07	6.2.0	1.15.2.0	14.850: 15.594
	24.11	6.2.4	1.15.2.0	14.727: 15.644
			1.20.1	14.729: 15.447
			1.22.0	14.990: 15.601

It would probably be good to confirm with NE1024 runs.

grnydawn · 2025-02-06T21:57:56Z

@jgfouca , Just to check, do you think crayclang-scream needs any changes due to this system software updates on Frontier?

jgfouca · 2025-02-06T22:06:35Z

@grnydawn my guess, looking at the module changes, is no.

ndkeen · 2025-02-06T23:35:43Z

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

trey-ornl · 2025-02-07T17:58:10Z

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test libfabric/1.22.0 on the internal system.

I compared BFB using bfbhash> 234 on the NE30 runs.

Changing crayclang-scream from libfabric/1.15.2.0 to libfabric/1.20.1 was BFB.
Changing craygnuamdgpu from cpe/24.07+rocm/6.2.0 to cpe/24.11+rocm/6.2.4, both with libfabric/1.15.2.0, was not BFB.
Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.15.20.0 to libfabric/1.20.1 was BFB.
Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.20.1 on Frontier to libfabric/1.22.0 on our internal computer was not BFB.

ndkeen · 2025-02-07T18:08:17Z

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

grnydawn · 2025-02-07T20:40:49Z

@jgfouca , @rljacob could you assign at least one reviewer for this PR?

rljacob · 2025-02-07T20:43:16Z

I would like us to settle the machine/compiler definition issue (#6773) first and then apply these changes to the unified entry.

xylar · 2025-02-07T20:52:51Z

Would this also be a good chance to set the missing MPICH_GPU_SUPPORT_ENABLED variables? See
E3SM-Project#198
E3SM-Project/polaris#275
E3SM-Project#196
E3SM-Project/mache#231

trey-ornl · 2025-02-07T22:09:27Z

@rljacob, @xylar This pull request has some urgency, as it is needed before the Frontier software changes scheduled for February 18.

rljacob · 2025-02-07T22:25:50Z

I'm aware but 11 days seems like enough time to settle the machine/compiler name issue.

xylar · 2025-02-07T22:29:52Z

Okay, the MPICH_GPU_SUPPORT_ENABLED issue can be addressed separately. It just seemed like it would fit an spare us another PR. This was done for Perlmuter nearly three ago: #4833

trey-ornl · 2025-02-11T18:18:26Z

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

I am trying an NE1024 test from @whannah1, and I got a seg fault for the new compiler setting. I'm investigating with NE30 and NE256 versions of the same test. To be continued...

Update config_machines.xml for Feb 18 software update on Frontier.

bb10170

rljacob assigned grnydawn Feb 6, 2025

rljacob added Machine Files Frontier labels Feb 6, 2025

rljacob requested a review from jgfouca February 7, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Update config_machines.xml for Feb 18 software update on Frontier. #6977

trey-ornl commented Feb 6, 2025

grnydawn commented Feb 6, 2025

jgfouca commented Feb 6, 2025

ndkeen commented Feb 6, 2025

trey-ornl commented Feb 7, 2025

ndkeen commented Feb 7, 2025

grnydawn commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025 •

edited

Loading

trey-ornl commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025

trey-ornl commented Feb 11, 2025

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Are you sure you want to change the base?

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Conversation

trey-ornl commented Feb 6, 2025

grnydawn commented Feb 6, 2025

jgfouca commented Feb 6, 2025

ndkeen commented Feb 6, 2025

trey-ornl commented Feb 7, 2025

ndkeen commented Feb 7, 2025

grnydawn commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025 • edited Loading

trey-ornl commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025

trey-ornl commented Feb 11, 2025

xylar commented Feb 7, 2025 •

edited

Loading