Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

trey-ornl
Copy link
Contributor

@trey-ornl trey-ornl commented Feb 4, 2025

This pull request attempts to provide Frontier optimizations even better than those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6, and with additional changes to reduce slowdown on Perlmutter CPUs.

It replaces pull request #6522.

Summary of changes:

  • Create multiple new structs in SphereOperators.hpp to use for the new Caar pre-boundary exchange.
  • Create macros in SphereOperators.hpp that allow code that uses implicit parallelism and vector registers on GPUs to add explicit loops and temporary arrays on CPUs.
  • Use #if (WARP_SIZE == 1) preprocessor directives in SphereOperators.hpp to try to minimize CPU-specific code.
  • Add indexing functions like zbelow in SphereOperators.hpp to support Scalar types with VECTOR_SIZE > 1 on CPUs.
  • Add new file CaarFunctorImpl.cpp that implements the new caar_compute function and template functions with Kokkos loops. The single source code supports both GPUs and CPUs by relying on functions and macros defined in SphereOperators.hpp.
  • Update CaarFunctorImpl.hpp with new functions, slight changes to temporary buffers, and #if to turn on/off the new caar_compute. If we adopt the new implementation permanently, significant code can be eliminated from this file.
  • Add new viewAsReal functions in ViewUtils.hpp.
  • Add preprocessor directives around LaunchBounds<512,1> calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.
  • Add frontier-bfb.cmake, frontier-bfb-serial.cmake, and pm-cpu-bfb.cmake files for bit-for-bit unit testing of Caar.

I confirmed that the modified code passes the caar_ut unit test, and I ran a single-node NE30 test from Noel Keen on Frontier, Perlmutter GPU, and Perlmutter CPU. Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node (8 on Frontier GPU, 4 on Perlmutter GPU, 128 on Perlmutter CPU).

Machine Original Code #if 0 New Code #if 1 Speedup
Frontier GPU 182.5 73.42 2.49
Perlmutter GPU 85.36 55.69 1.53
Perlmutter CPU Gnu 12449 13697 0.909
Perlmutter CPU Intel 6801 12952 0.525

The good news is that the new code is faster on both Frontier and Perlmutter GPUs. The bad news is that it slows down Perlmutter CPUs. In particular, it appears to inhibit whatever optimization the Intel compiler is able to do over the Gnu compiler.

@ambrad
Copy link
Member

ambrad commented Feb 5, 2025

Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node

Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table?

As an example, from this line

"a:compute_stage_value_dirk"  -    320      320 4.608000e+06   1.360819e+04    43.766 (   216      0)    39.147 (   296      0)

we see that the maximum is 43.766 sec and the average is 1.360819e+04/320 = 42.52559375. The quantity being measured is "time spent in DIRK over the course of the run", where each rank makes 4.608000e+06/320 = 14400.0 calls to DIRK over the course of the run.

@ambrad ambrad marked this pull request as draft February 5, 2025 02:07
@trey-ornl
Copy link
Contributor Author

Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table?

The times listed above are from the fourth column of numbers in the timer output, 1.360819e+04 in your example. Here is the raw output used for the table.

Frontier GPU original:

login03:/lustre/orion/proj-shared/cli200/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu.craygnuamdgpu.5d.nocosp.p3sk.shocsk.25-01-02-11.58.09/case_scripts/timing$ zgrep 'caar compute' e3sm_timing_stats.2921621.250108-174410.gz
"a:caar compute"                                              -          8        8 1.152000e+05   1.825385e+02    22.864 (     2      0)    22.764 (     1      0)

Frontier GPU new:

login03:/lustre/orion/proj-shared/cli200/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu.craygnuamdgpu.5d.nocosp.p3sk.shocsk.25-01-02-11.58.09/case_scripts/timing$ zgrep 'caar compute' e3sm_timing_stats.2921576.250108-171302.gz
"a:caar compute"                                              -          8        8 1.152000e+05   7.342101e+01     9.192 (     7      0)     9.147 (     0      0)

Perlmutter GPU original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu.gnugpu.5d.nocosp.p3sk.shocsk.25-01-09-08.22.26/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34721245.250109-100816.gz
"a:caar compute"                                              -          4        4 5.760000e+04   8.706921e+01    21.814 (     3      0)    21.730 (     1      0)

Perlmutter GPU new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu.gnugpu.5d.nocosp.p3sk.shocsk.25-01-09-08.22.26/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719801.250109-083940.gz
"a:caar compute"                                              -          4        4 5.760000e+04   5.568991e+01    14.070 (     2      0)    13.750 (     0      0)

Perlmutter CPU Gnu original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.gnu.5d.nocosp.p3sk.shocsk.25-01-09-07.54.07/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34720749.250109-091742.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.244864e+04   102.537 (    16      0)    94.639 (   101      0)

Perlmutter CPU Gnu new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.gnu.5d.nocosp.p3sk.shocsk.25-01-09-07.54.07/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719528.250109-083936.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.369728e+04   138.202 (    11      0)    87.969 (   101      0)

Perlmutter CPU Intel original:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.intel.5d.nocosp.p3sk.shocsk.25-01-09-08.03.14/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34720964.250109-093013.gz
"a:caar compute"                                              -        128      128 1.843200e+06   6.800835e+03    61.787 (    14      0)    47.484 (   101      0)

Perlmutter CPU Intel new:

trey@perlmutter:login30:/pscratch/sd/t/trey/nk-ne30/cases/t.caar.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu.intel.5d.nocosp.p3sk.shocsk.25-01-09-08.03.14/case_scripts/timing> zgrep 'caar compute' e3sm_timing_stats.34719150.250109-082107.gz
"a:caar compute"                                              -        128      128 1.843200e+06   1.295275e+04   136.372 (    14      0)    80.493 (   103      0)

@ambrad
Copy link
Member

ambrad commented Feb 5, 2025

Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers.

@trey-ornl
Copy link
Contributor Author

Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers.

Oops! Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants