-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update config_machines.xml for Feb 18 software update on Frontier. #6977
base: master
Are you sure you want to change the base?
Update config_machines.xml for Feb 18 software update on Frontier. #6977
Conversation
@jgfouca , Just to check, do you think |
@grnydawn my guess, looking at the module changes, is no. |
Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024? |
The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test I compared BFB using
|
I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging. |
I would like us to settle the machine/compiler definition issue (#6773) first and then apply these changes to the unified entry. |
Would this also be a good chance to set the missing |
I'm aware but 11 days seems like enough time to settle the machine/compiler name issue. |
Okay, the |
I am trying an NE1024 test from @whannah1, and I got a seg fault for the new compiler setting. I'm investigating with NE30 and NE256 versions of the same test. To be continued... |
Various changes to
config_machines.xml
in preparation for system software updates on Frontier on 2025-02-18. Within thefrontier-scream-gpu
machine:lmod
paths from/usr/share
to/opt/cray/pe
because the/usr/share
software is not maintained and is not available on internal OLCF test computers.craygnuamdgpu
to latest version ofcpe
available on Frontier,cpe/24.11
.craygnuamdgpu
. I used to think this wasn't needed, sincemodule load cpe/24.11
should set all the appropriate defaults. I recently discovered that this setting of defaults only works if themodule load cpe/24.11
is a separate command before the othermodule load
s. Cime uses a singlemodule load
with a list of modules, which does not update the defaults. I selected the explicit versions based on thecpe/24.11
defaults, including the default forrocm
,rocm/6.2.4
.libfabric
module fromcraygnuamdgpu
. On Feb 18, the defaultlibfabric
will change fromlibfabric/1.20.1
tolibfabric/1.22.0
, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed thelibfabric
module fromcraygnuamdgpu
so that it will stay with the default when it changes.libfabric/1.15.2.0
tolibfabric/1.20.1
forcrayclang-scream
. On Feb 18,libfabric/1.15.2.0
will disappear from Frontier. And the new default,libfabric/1.22.0
, does not support Cray MPI versions beforecray-mpich/8.1.28
. Becausecrayclang-scream
is stuck back atcpe/22.12
andcray-mpich/8.1.26
, we have to uselibfabric/1.20.1
. We previously switched fromlibfabric/1.20.1
tolibfabric/1.15.2.0
because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean thatcrayclang-scream
will be using a version oflibfabric
that isn't officially supported with thelibfabric/1.22.0
drivers, mais c'est la vie.I ran an 8-node test of NE30 on Frontier and on our internal system (for
libfabric/1.22.0
) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported ine3sm_timing.t.*
.cpe
rocm
libfabric
CPM:ATM_RUN
crayclang-scream
craygnuamdgpu
It would probably be good to confirm with NE1024 runs.