Skip to content

Conversation

joneuhauser
Copy link

@joneuhauser joneuhauser commented Sep 9, 2025

Progress tracking:

  • Consistent formatting to make the diff smaller in the end (+ preservation with pre-commit)
  • CMake as build tool
  • Compiles and runs on ifort (tested with version 19.0.1) and nvfortran (24.7)
  • Add an automated test asserting correct data after 2 timesteps
    • Gets build automatically with cmake
    • Run with ctest
    • Test automatically runs on 1 or 2 MPI ranks, tested locally with nvfortran, ifort and gfortran
  • Port FFT routines to CuFFT / HIPFFT
    • Currently only CuFFT, but HIPFFT shares almost the same API
  • Port remaining hot loops to OpenMP
  • Port the MPI communication
  • Profile
  • Remove remaining D/H communication
  • Less redundant memory allocations
  • Check multi-node performance (Hunter / Horeka)

Helpful flags / tools

  • Run with NVCOMPILER_ACC_NOTIFY=3 to print all data movements and kernel launches
  • Run with compute-sanitize to learn about out of bounds accesses.
  • Run with nsys profile --stats=true --trace=cuda,osrt,openmp mpirun -n 1 build/channel to get profiling statistics

Necessary code changes:

  • Added implicit none to all subroutines.
  • Struggled quite a bit with module-level variables used inside device functions. Solution (tentative):
    • The functions themselves need to be decorated with $!omp declare target(function-list). Needs to happen immediately before or after the function declaration.
    • Scalars and fixed size arrays: These can be made !$omp declare target(variable-list), which makes them available in device functions, but this does NOT a map - target link is not supported by nvfortran yet. Instead they need to be synchronized manually with $!omp update to(variable-list).
    • Allocatable arrays: These can not be treated like scalars because at the point where we would declare target them, they don't have a size (and imo the compiler should error out if you try that). Instead, pass them as arguments to the device function.
  • Vectorizing the linsolve subroutine was straightforward then except for d2vmat and etamat which had a race condition - but this could be trivially solved by adding another axis to those arrays.
    • Profile whether this is faster or whether we should just declare them as private and only size them as d2vmat(ny0:nyN+2,-2:2). -> turns out the compiler optimizes them away anyway
  • Modified the order of the loops in buildrhs. In order to have iz and ix first, and iz as inner loop, we have to pre-compute convolutions for each iy. Hence, enlarge the array. This also enables us to use one single FFT kernel for the entire (local) domain.

@joneuhauser
Copy link
Author

joneuhauser commented Sep 11, 2025

I have to stop here (might cont. next week) but this is already working pretty well; the biggest remaining task is to change the way the global transposes work; they have to become alltoallv calls, and use device buffers.

  • Then, I've allocated quite a bit of memory which we probably don't need all, find out where and get rid of it.
  • Also there are some redundant computations in the FFT; need to check those. Need to reorder the indices of the FFT work arrays - done that, made FFTs as expected about 30% faster

At the moment, GPU computations make out about 15% of the runtime, of which about 45% linsolve, 30% cuFFT kernels.

@joneuhauser
Copy link
Author

joneuhauser commented Sep 17, 2025

First profiling results (on istmhelios, 2xA5000 with 24 GB RAM). Mesh: 95x192x189, 2 scalars, 5 timesteps

Category 2-rank time (s) 2-rank % 1-rank time (s) 1-rank %
linsolve 0.976864 24.76% 2.095284 32.73%
FFTs 1.601523 40.60% 3.177510 49.64%
buildrhs 0.158021 4.01% 0.314110 4.91%
convolutions 0.265173 6.72% 0.531981 8.31%
x-z transpose kernels 0.235830 5.98% 0.282503 4.41%
MPI 0.707607 17.94% 0.00%
Total (5 steps) 3.94 100% 6.40 100%
Per step (avg) 0.789 1.280

That's surprisingly nice, the MPI transposes only take about 20% of the runtime. (The times without scalar are roughly half that btw).

There's a lot of inefficient memory use at the moment (lots of full temporary arrays), so at the moment I only fit Christian's Re_tau=500 case with half the x modes into memory (no scalars, 251 x 250 x 257 mesh), but that's decently fast at least (1.6s per timestep on these relatively weak GPUs)

Let's do some profiling on MI300A / H100 next 🚀

@joneuhauser
Copy link
Author

Currently, the memory use for an Re_tau=500 case is 80GB, and therefore for an Re_tau=1000 case 640 GB. An MI300A node has 128 * 4 = 512 GB of HBM memory, so we need to trim the memory footprint a bit for it to fit (including scalars we will need 2 nodes for sure).

But before we do any pointer juggling, I'd like @davecats to add Pr > 1 first ;)

@joneuhauser
Copy link
Author

Expected MPI performance papers: https://arxiv.org/pdf/2408.14090 (H100), https://arxiv.org/pdf/2508.11298 (MI300A)

@joneuhauser
Copy link
Author

Here's an up to date comparison on the sample problem size 251 x 250 x 257, no scalars, single rank (kernel performance).

So TODOs in terms of Kernel optimisation @KarlVieths:

  • NVIDIA: convolution kernels
  • AMD: buildrhs
Kernel Group NVIDIA H100 (ms) MI300A crayftn (ms) MI300A amdfort 7.1.0 (ms)
Transpose packing 71.0 48.08 52.1
linsolve 212.9 290.3 192.9
buildrhs 41.3 142.6 73.8
FFTs 68.7 69.7 76.1
Convolution kernels 149.0 61.3 74.2
Total 542.9 612.0 469.0

@joneuhauser
Copy link
Author

joneuhauser commented Sep 25, 2025

Scaling using crayftn on Hunter, this time using GPU-to-GPU-transfers, averaged over 40 timesteps:

mpirun -ppn 4 -np 4 --cpu-bind list:0-23:24-47:58-71:72-95 --gpu-bind list:0:1:2:3 rocprofv3 --sys-trace --marker-trace --kernel-trace --hsa-trace --output-format csv -- build/channel
251 x 250 x 257, no scalars 1 rank (ms) 2 ranks (ms) 4 ranks (ms)
MPI comm 92.2 84.4
Transpose packing 52.2 35.9 16.8
linsolve 194.4 91.8 31.8
buildrhs 122.5 62.6 103.5
FFTs 75.1 35.4 17.8
convolution kernels 68.3 29.8 19.1
Total 512.5 347.7 273.6

Scaling of buildrhs is problematic.

And for a bigger problem (the local problem size is roughly twice of the 1 rank case above):

511 x 500 x 512, no scalars 4 ranks (ms)
MPI comm 487.6
Transpose packing 114.1
linsolve 392.8
buildrhs 240.1
FFTs 120.2
convolution kernels 108.3
Other 23.1
Total 1486.2

@joneuhauser
Copy link
Author

I'll summarize remaining steps here:

  • Intra-node MPI transfers on NVIDIA (Horeka / UC3)
  • y-parallelisation: allows to have smaller collectives for the alltoalls, so we can keep that traffic intra-node. In y-direction, we only have to exchange boundary information for the pentadiagonal solve
  • Separate out kernels to we can solve velocity and scalar after each other, reducing memory footprint and allowing for overlapping communication of computation (while we communicate velocity data, we can already start computing the scalar data etc)
  • Then switch to scratch arrays allocated directly on the device that are reused to reduce memory footprint further
  • Benchmarking, optimisation, scaling tests

@joneuhauser
Copy link
Author

I tried to overlap communication and computation; the code has been refactored to allow for it, but as of now, it doesn't work - in contact with NVIDIA and AMD folks about that.

But, in order to do so, I was able to shrink the memory footprint significantly - now a 251 x 250 x 257 case requires 16 GB + (3 GB per scalar) + (5 GB overhead if CHANNEL_OVERLAPPING is on).

If running 1023 x500x512 on 4 ranks, I get weird integer overflow errors in kernels, which I haven't been able to comprehensively fix. It works with 8 ranks, where I get about 2.2s per timestep (+ 0.7s per scalar).

It crashes inside the pack z to x kernel now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants