Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

trey-ornl · 2025-02-04T23:01:17Z

This pull request attempts to provide Frontier optimizations even better than those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6, and with additional changes to reduce slowdown on Perlmutter CPUs.

It replaces pull request #6522.

Summary of changes:

Create multiple new structs in SphereOperators.hpp to use for the new Caar pre-boundary exchange.
Create macros in SphereOperators.hpp that allow code that uses implicit parallelism and vector registers on GPUs to add explicit loops and temporary arrays on CPUs.
Use #if (WARP_SIZE == 1) preprocessor directives in SphereOperators.hpp to try to minimize CPU-specific code.
Add indexing functions like zbelow in SphereOperators.hpp to support Scalar types with VECTOR_SIZE > 1 on CPUs.
Add new file CaarFunctorImpl.cpp that implements the new caar_compute function and template functions with Kokkos loops. The single source code supports both GPUs and CPUs by relying on functions and macros defined in SphereOperators.hpp.
Update CaarFunctorImpl.hpp with new functions, slight changes to temporary buffers, and #if to turn on/off the new caar_compute. If we adopt the new implementation permanently, significant code can be eliminated from this file.
Add new viewAsReal functions in ViewUtils.hpp.
Add preprocessor directives around LaunchBounds<512,1> calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.
Add frontier-bfb.cmake, frontier-bfb-serial.cmake, and pm-cpu-bfb.cmake files for bit-for-bit unit testing of Caar.

I confirmed that the modified code passes the caar_ut unit test, and I ran a single-node NE30 test from Noel Keen on Frontier, Perlmutter GPU, and Perlmutter CPU. Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node (8 on Frontier GPU, 4 on Perlmutter GPU, 128 on Perlmutter CPU).

Machine	Original Code `#if 0`	New Code `#if 1`	Speedup
Frontier GPU	182.5	73.42	2.49
Perlmutter GPU	85.36	55.69	1.53
Perlmutter CPU Gnu	12449	13697	0.909
Perlmutter CPU Intel	6801	12952	0.525

The good news is that the new code is faster on both Frontier and Perlmutter GPUs. The bad news is that it slows down Perlmutter CPUs. In particular, it appears to inhibit whatever optimization the Intel compiler is able to do over the Gnu compiler.

ambrad · 2025-02-05T01:57:26Z

Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node

Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table?

As an example, from this line

"a:compute_stage_value_dirk"  -    320      320 4.608000e+06   1.360819e+04    43.766 (   216      0)    39.147 (   296      0)

we see that the maximum is 43.766 sec and the average is 1.360819e+04/320 = 42.52559375. The quantity being measured is "time spent in DIRK over the course of the run", where each rank makes 4.608000e+06/320 = 14400.0 calls to DIRK over the course of the run.

trey-ornl · 2025-02-05T16:48:59Z

Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table?

The times listed above are from the fourth column of numbers in the timer output, 1.360819e+04 in your example. Here is the raw output used for the table.

Frontier GPU original:

login03:/lustre/orion/proj-shared/cli200/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu.craygnuamdgpu.5d.nocosp.p3sk.shocsk.25-01-02-11.58.09/case_scripts/timing$ zgrep 'caar compute' e3sm_timing_stats.2921621.250108-174410.gz
"a:caar compute"                                              -          8        8 1.152000e+05   1.825385e+02    22.864 (     2      0)    22.764 (     1      0)

Frontier GPU new:

login03:/lustre/orion/proj-shared/cli200/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu.craygnuamdgpu.5d.nocosp.p3sk.shocsk.25-01-02-11.58.09/case_scripts/timing$ zgrep 'caar compute' e3sm_timing_stats.2921576.250108-171302.gz
"a:caar compute"                                              -          8        8 1.152000e+05   7.342101e+01     9.192 (     7      0)     9.147 (     0      0)

Perlmutter GPU original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu.gnugpu.5d.nocosp.p3sk.shocsk.25-01-09-08.22.26/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34721245.250109-100816.gz
"a:caar compute"                                              -          4        4 5.760000e+04   8.706921e+01    21.814 (     3      0)    21.730 (     1      0)

Perlmutter GPU new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu.gnugpu.5d.nocosp.p3sk.shocsk.25-01-09-08.22.26/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719801.250109-083940.gz
"a:caar compute"                                              -          4        4 5.760000e+04   5.568991e+01    14.070 (     2      0)    13.750 (     0      0)

Perlmutter CPU Gnu original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.gnu.5d.nocosp.p3sk.shocsk.25-01-09-07.54.07/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34720749.250109-091742.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.244864e+04   102.537 (    16      0)    94.639 (   101      0)

Perlmutter CPU Gnu new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.gnu.5d.nocosp.p3sk.shocsk.25-01-09-07.54.07/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719528.250109-083936.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.369728e+04   138.202 (    11      0)    87.969 (   101      0)

Perlmutter CPU Intel original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.intel.5d.nocosp.p3sk.shocsk.25-01-09-08.03.14/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34720964.250109-093013.gz
"a:caar compute"                                              -        128      128 1.843200e+06   6.800835e+03    61.787 (    14      0)    47.484 (   101      0)

Perlmutter CPU Intel new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.intel.5d.nocosp.p3sk.shocsk.25-01-09-08.03.14/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719150.250109-082107.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.295275e+04   136.372 (    14      0)    80.493 (   103      0)

ambrad · 2025-02-05T16:57:35Z

Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers.

trey-ornl · 2025-02-05T17:04:30Z

Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers.

Oops! Fixed.

trey-ornl added 2 commits February 4, 2025 12:44

Merge branch trey/caar/frontier_gnu from trey-ornl/scream.

ce08ca3

Turn on tuned code in CaarFunctorImpl.hpp.

f804118

trey-ornl added HOMME Frontier labels Feb 4, 2025

trey-ornl requested a review from bartgol February 4, 2025 23:01

trey-ornl mentioned this pull request Feb 4, 2025

Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522

Closed

ambrad marked this pull request as draft February 5, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

trey-ornl commented Feb 4, 2025 •

edited

Loading

ambrad commented Feb 5, 2025 •

edited

Loading

trey-ornl commented Feb 5, 2025

ambrad commented Feb 5, 2025 •

edited

Loading

trey-ornl commented Feb 5, 2025

Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

Are you sure you want to change the base?

Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

Conversation

trey-ornl commented Feb 4, 2025 • edited Loading

ambrad commented Feb 5, 2025 • edited Loading

trey-ornl commented Feb 5, 2025

ambrad commented Feb 5, 2025 • edited Loading

trey-ornl commented Feb 5, 2025

trey-ornl commented Feb 4, 2025 •

edited

Loading

ambrad commented Feb 5, 2025 •

edited

Loading

ambrad commented Feb 5, 2025 •

edited

Loading