-
Notifications
You must be signed in to change notification settings - Fork 8
WIP: GPU Port using OpenMP #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: simpleGPU
Are you sure you want to change the base?
Conversation
Add pre-commit config
I have to stop here (might cont. next week) but this is already working pretty well; the biggest remaining task is to change the way the global transposes work; they have to become alltoallv calls, and use device buffers.
At the moment, GPU computations make out about 15% of the runtime, of which about 45% linsolve, 30% cuFFT kernels. |
First profiling results (on istmhelios, 2xA5000 with 24 GB RAM). Mesh: 95x192x189, 2 scalars, 5 timesteps
That's surprisingly nice, the MPI transposes only take about 20% of the runtime. (The times without scalar are roughly half that btw). There's a lot of inefficient memory use at the moment (lots of full temporary arrays), so at the moment I only fit Christian's Re_tau=500 case with half the x modes into memory (no scalars, 251 x 250 x 257 mesh), but that's decently fast at least (1.6s per timestep on these relatively weak GPUs) Let's do some profiling on MI300A / H100 next 🚀 |
Currently, the memory use for an Re_tau=500 case is 80GB, and therefore for an Re_tau=1000 case 640 GB. An MI300A node has 128 * 4 = 512 GB of HBM memory, so we need to trim the memory footprint a bit for it to fit (including scalars we will need 2 nodes for sure). But before we do any pointer juggling, I'd like @davecats to add Pr > 1 first ;) |
Expected MPI performance papers: https://arxiv.org/pdf/2408.14090 (H100), https://arxiv.org/pdf/2508.11298 (MI300A) |
Here's an up to date comparison on the sample problem size 251 x 250 x 257, no scalars, single rank (kernel performance). So TODOs in terms of Kernel optimisation @KarlVieths:
|
Scaling using crayftn on Hunter, this time using GPU-to-GPU-transfers, averaged over 40 timesteps:
Scaling of buildrhs is problematic. And for a bigger problem (the local problem size is roughly twice of the 1 rank case above):
|
I'll summarize remaining steps here:
|
I tried to overlap communication and computation; the code has been refactored to allow for it, but as of now, it doesn't work - in contact with NVIDIA and AMD folks about that. But, in order to do so, I was able to shrink the memory footprint significantly - now a 251 x 250 x 257 case requires 16 GB + (3 GB per scalar) + (5 GB overhead if CHANNEL_OVERLAPPING is on). If running 1023 x500x512 on 4 ranks, I get weird integer overflow errors in kernels, which I haven't been able to comprehensively fix. It works with 8 ranks, where I get about 2.2s per timestep (+ 0.7s per scalar). |
It crashes inside the pack z to x kernel now.
Progress tracking:
ctest
Helpful flags / tools
NVCOMPILER_ACC_NOTIFY=3
to print all data movements and kernel launchescompute-sanitize
to learn about out of bounds accesses.nsys profile --stats=true --trace=cuda,osrt,openmp mpirun -n 1 build/channel
to get profiling statisticsNecessary code changes:
implicit none
to all subroutines.$!omp declare target(function-list)
. Needs to happen immediately before or after the function declaration.!$omp declare target(variable-list)
, which makes them available in device functions, but this does NOT a map -target link
is not supported by nvfortran yet. Instead they need to be synchronized manually with$!omp update to(variable-list)
.declare target
them, they don't have a size (and imo the compiler should error out if you try that). Instead, pass them as arguments to the device function.linsolve
subroutine was straightforward then except ford2vmat
andetamat
which had a race condition - but this could be trivially solved by adding another axis to those arrays.private
and only size them asd2vmat(ny0:nyN+2,-2:2)
. -> turns out the compiler optimizes them away anywaybuildrhs
. In order to have iz and ix first, and iz as inner loop, we have to pre-computeconvolutions
for each iy. Hence, enlarge the array. This also enables us to use one single FFT kernel for the entire (local) domain.