-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972
base: master
Are you sure you want to change the base?
Conversation
Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table? As an example, from this line
we see that the maximum is 43.766 sec and the average is 1.360819e+04/320 = 42.52559375. The quantity being measured is "time spent in DIRK over the course of the run", where each rank makes 4.608000e+06/320 = 14400.0 calls to DIRK over the course of the run. |
The times listed above are from the fourth column of numbers in the timer output, Frontier GPU original:
Frontier GPU new:
Perlmutter GPU original:
Perlmutter GPU new:
Perlmutter CPU Gnu original:
Perlmutter CPU Gnu new:
Perlmutter CPU Intel original:
Perlmutter CPU Intel new:
|
Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers. |
Oops! Fixed. |
This pull request attempts to provide Frontier optimizations even better than those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6, and with additional changes to reduce slowdown on Perlmutter CPUs.
It replaces pull request #6522.
Summary of changes:
struct
s inSphereOperators.hpp
to use for the new Caar pre-boundary exchange.SphereOperators.hpp
that allow code that uses implicit parallelism and vector registers on GPUs to add explicit loops and temporary arrays on CPUs.#if (WARP_SIZE == 1)
preprocessor directives inSphereOperators.hpp
to try to minimize CPU-specific code.zbelow
inSphereOperators.hpp
to supportScalar
types withVECTOR_SIZE
> 1 on CPUs.CaarFunctorImpl.cpp
that implements the newcaar_compute
function and template functions with Kokkos loops. The single source code supports both GPUs and CPUs by relying on functions and macros defined inSphereOperators.hpp
.CaarFunctorImpl.hpp
with new functions, slight changes to temporary buffers, and#if
to turn on/off the newcaar_compute
. If we adopt the new implementation permanently, significant code can be eliminated from this file.viewAsReal
functions inViewUtils.hpp
.LaunchBounds<512,1>
calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.frontier-bfb.cmake
,frontier-bfb-serial.cmake
, andpm-cpu-bfb.cmake
files for bit-for-bit unit testing of Caar.I confirmed that the modified code passes the
caar_ut
unit test, and I ran a single-node NE30 test from Noel Keen on Frontier, Perlmutter GPU, and Perlmutter CPU. Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node (8 on Frontier GPU, 4 on Perlmutter GPU, 128 on Perlmutter CPU).#if 0
#if 1
The good news is that the new code is faster on both Frontier and Perlmutter GPUs. The bad news is that it slows down Perlmutter CPUs. In particular, it appears to inhibit whatever optimization the Intel compiler is able to do over the Gnu compiler.