WIP: GPU Port using OpenMP #17

joneuhauser · 2025-09-09T13:22:34Z

Progress tracking:

Helpful flags / tools

Run with NVCOMPILER_ACC_NOTIFY=3 to print all data movements and kernel launches
Run with compute-sanitize to learn about out of bounds accesses.
Run with nsys profile --stats=true --trace=cuda,osrt,openmp mpirun -n 1 build/channel to get profiling statistics

Necessary code changes:

Added implicit none to all subroutines.
Struggled quite a bit with module-level variables used inside device functions. Solution (tentative):
- The functions themselves need to be decorated with $!omp declare target(function-list). Needs to happen immediately before or after the function declaration.
- Scalars and fixed size arrays: These can be made !$omp declare target(variable-list), which makes them available in device functions, but this does NOT a map - target link is not supported by nvfortran yet. Instead they need to be synchronized manually with $!omp update to(variable-list).
- Allocatable arrays: These can not be treated like scalars because at the point where we would declare target them, they don't have a size (and imo the compiler should error out if you try that). Instead, pass them as arguments to the device function.
Vectorizing the linsolve subroutine was straightforward then except for d2vmat and etamat which had a race condition - but this could be trivially solved by adding another axis to those arrays.
- Profile whether this is faster or whether we should just declare them as private and only size them as d2vmat(ny0:nyN+2,-2:2). -> turns out the compiler optimizes them away anyway
Modified the order of the loops in buildrhs. In order to have iz and ix first, and iz as inner loop, we have to pre-compute convolutions for each iy. Hence, enlarge the array. This also enables us to use one single FFT kernel for the entire (local) domain.

Add pre-commit config

joneuhauser · 2025-09-11T22:51:37Z

I have to stop here (might cont. next week) but this is already working pretty well; the biggest remaining task is to change the way the global transposes work; they have to become alltoallv calls, and use device buffers.

Then, I've allocated quite a bit of memory which we probably don't need all, find out where and get rid of it.
Also there are some redundant computations in the FFT; need to check those. Need to reorder the indices of the FFT work arrays - done that, made FFTs as expected about 30% faster

At the moment, GPU computations make out about 15% of the runtime, of which about 45% linsolve, 30% cuFFT kernels.

…ents

joneuhauser · 2025-09-17T20:36:10Z

First profiling results (on istmhelios, 2xA5000 with 24 GB RAM). Mesh: 95x192x189, 2 scalars, 5 timesteps

Category	2-rank time (s)	2-rank %	1-rank time (s)	1-rank %
linsolve	0.976864	24.76%	2.095284	32.73%
FFTs	1.601523	40.60%	3.177510	49.64%
buildrhs	0.158021	4.01%	0.314110	4.91%
convolutions	0.265173	6.72%	0.531981	8.31%
x-z transpose kernels	0.235830	5.98%	0.282503	4.41%
MPI	0.707607	17.94%	—	0.00%
Total (5 steps)	3.94	100%	6.40	100%
Per step (avg)	0.789	—	1.280	—

That's surprisingly nice, the MPI transposes only take about 20% of the runtime. (The times without scalar are roughly half that btw).

There's a lot of inefficient memory use at the moment (lots of full temporary arrays), so at the moment I only fit Christian's Re_tau=500 case with half the x modes into memory (no scalars, 251 x 250 x 257 mesh), but that's decently fast at least (1.6s per timestep on these relatively weak GPUs)

Let's do some profiling on MI300A / H100 next 🚀

joneuhauser · 2025-09-17T20:53:00Z

Currently, the memory use for an Re_tau=500 case is 80GB, and therefore for an Re_tau=1000 case 640 GB. An MI300A node has 128 * 4 = 512 GB of HBM memory, so we need to trim the memory footprint a bit for it to fit (including scalars we will need 2 nodes for sure).

But before we do any pointer juggling, I'd like @davecats to add Pr > 1 first ;)

joneuhauser · 2025-09-23T13:41:32Z

Expected MPI performance papers: https://arxiv.org/pdf/2408.14090 (H100), https://arxiv.org/pdf/2508.11298 (MI300A)

joneuhauser · 2025-09-24T12:03:58Z

Here's an up to date comparison on the sample problem size 251 x 250 x 257, no scalars, single rank (kernel performance).

So TODOs in terms of Kernel optimisation @KarlVieths:

NVIDIA: convolution kernels
AMD: buildrhs

Kernel Group	NVIDIA H100 (ms)	MI300A crayftn (ms)	MI300A amdfort 7.1.0 (ms)
Transpose packing	71.0	48.08	52.1
linsolve	212.9	290.3	192.9
buildrhs	41.3	142.6	73.8
FFTs	68.7	69.7	76.1
Convolution kernels	149.0	61.3	74.2
Total	542.9	612.0	469.0

joneuhauser · 2025-09-25T09:05:18Z

Scaling using crayftn on Hunter, this time using GPU-to-GPU-transfers, averaged over 40 timesteps:

mpirun -ppn 4 -np 4 --cpu-bind list:0-23:24-47:58-71:72-95 --gpu-bind list:0:1:2:3 rocprofv3 --sys-trace --marker-trace --kernel-trace --hsa-trace --output-format csv -- build/channel

251 x 250 x 257, no scalars	1 rank (ms)	2 ranks (ms)	4 ranks (ms)
MPI comm	–	92.2	84.4
Transpose packing	52.2	35.9	16.8
linsolve	194.4	91.8	31.8
buildrhs	122.5	62.6	103.5
FFTs	75.1	35.4	17.8
convolution kernels	68.3	29.8	19.1
Total	512.5	347.7	273.6

Scaling of buildrhs is problematic.

And for a bigger problem (the local problem size is roughly twice of the 1 rank case above):

511 x 500 x 512, no scalars	4 ranks (ms)
MPI comm	487.6
Transpose packing	114.1
linsolve	392.8
buildrhs	240.1
FFTs	120.2
convolution kernels	108.3
Other	23.1
Total	1486.2

joneuhauser · 2025-09-25T09:18:26Z

I'll summarize remaining steps here:

Intra-node MPI transfers on NVIDIA (Horeka / UC3)
y-parallelisation: allows to have smaller collectives for the alltoalls, so we can keep that traffic intra-node. In y-direction, we only have to exchange boundary information for the pentadiagonal solve
Separate out kernels to we can solve velocity and scalar after each other, reducing memory footprint and allowing for overlapping communication of computation (while we communicate velocity data, we can already start computing the scalar data etc)
Then switch to scratch arrays allocated directly on the device that are reused to reduce memory footprint further
Benchmarking, optimisation, scaling tests

joneuhauser · 2025-10-06T15:53:42Z

I tried to overlap communication and computation; the code has been refactored to allow for it, but as of now, it doesn't work - in contact with NVIDIA and AMD folks about that.

But, in order to do so, I was able to shrink the memory footprint significantly - now a 251 x 250 x 257 case requires 16 GB + (3 GB per scalar) + (5 GB overhead if CHANNEL_OVERLAPPING is on).

If running 1023 x500x512 on 4 ranks, I get weird integer overflow errors in kernels, which I haven't been able to comprehensively fix. It works with 8 ranks, where I get about 2.2s per timestep (+ 0.7s per scalar).

It crashes inside the pack z to x kernel now.

davecats and others added 7 commits September 9, 2025 15:26

Simple version of the code without parallelisation

3f2a211

Removed compund datatypes (struct) for all variables

05477b6

Configure using cmake

0311576

Consistent formatting (fprettify -i 2 -w 2)

7fe576f

Remove Do concurrent instances

5568dd0

Add pre-commit config

Compiler flags for Intel fortran

29a3eb0

Fix uninitialized variables (which nvfortran doesn't like)

e4f8eb8

joneuhauser force-pushed the gpu-jn branch from e50625d to e4f8eb8 Compare September 9, 2025 13:26

joneuhauser added 2 commits September 10, 2025 16:12

Add a small test case comparing with a known field

62adfd6

Port of linsolve subroutine

970c382

joneuhauser force-pushed the gpu-jn branch from d9599fc to 970c382 Compare September 10, 2025 14:30

joneuhauser added 5 commits September 10, 2025 17:17

Reorder loop structure of buildrhs

9526fb9

Ported loop in buildrhs

768111d

Split the convolutions method into kernels

0394a9e

Port FFTs to cufft

11fe3d9

Performance improvements

99c8e48

joneuhauser and others added 8 commits September 12, 2025 01:18

Reorder indices of complex work arrays to allow for less redundant ffts

a4eafd4

Single rank version

9add74d

No d/h copy of V

dd1b6b6

make etamat and d2vmat small local arrays

dcaba0f

Workaround: nvfortran issue with out of bounds read for complex argum…

41c5647

…ents

Added regression test for scalar

22c83ae

Port features from the scalars branch

25297a4

Rewrite MPI comm to use GPU-aware MPI

2f92bb2

joneuhauser force-pushed the gpu-jn branch from 83b4d08 to 2f92bb2 Compare September 17, 2025 19:14

Assign GPUs correctly

ef7707d

Tests pass on hunter

c07ed24

joneuhauser added 2 commits September 19, 2025 21:37

Make D0, D1 etc. explicit (no slicing)

3ab6947

Allow to build without MPI & config for amdflang-new

204cba0

joneuhauser force-pushed the gpu-jn branch from c41502b to 204cba0 Compare September 22, 2025 15:47

joneuhauser added 3 commits September 24, 2025 12:41

Partial revert of 3ab6947

50e89a2

No private arrays in kernels (amdflang)

47d0c5c

Fix MPI finalize in tests

32dfedd

joneuhauser added 2 commits September 25, 2025 10:55

Use device-to-device transfer on MI300A

99ea063

Add roctx markers

ba54a60

joneuhauser force-pushed the gpu-jn branch from eb78e38 to 6a1eaf2 Compare September 29, 2025 12:09

Preparation for overlapping comm/computation

15efca9

joneuhauser force-pushed the gpu-jn branch from 6a1eaf2 to 15efca9 Compare September 29, 2025 12:40

Optimisations from Hunter

d018dda

joneuhauser force-pushed the gpu-jn branch from 627d55e to d018dda Compare September 30, 2025 12:54

joneuhauser added 4 commits October 2, 2025 12:25

Overlap communication and computation

c61ec00

Hunter fixes

3a6669e

Less redundant memory

cddf0d9

Add CHANNEL_OVERLAPPING environment variable to reduce memory footprint

12a2687

Attempt to fix integer overflow errors

34ee9d0

It crashes inside the pack z to x kernel now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: GPU Port using OpenMP #17

WIP: GPU Port using OpenMP #17

Uh oh!

joneuhauser commented Sep 9, 2025 •

edited

Loading

Uh oh!

joneuhauser commented Sep 11, 2025 •

edited by davecats

Loading

Uh oh!

joneuhauser commented Sep 17, 2025 •

edited

Loading

Uh oh!

joneuhauser commented Sep 17, 2025

Uh oh!

joneuhauser commented Sep 23, 2025

Uh oh!

joneuhauser commented Sep 24, 2025

Uh oh!

joneuhauser commented Sep 25, 2025 •

edited

Loading

Uh oh!

joneuhauser commented Sep 25, 2025

Uh oh!

joneuhauser commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: GPU Port using OpenMP #17

Are you sure you want to change the base?

WIP: GPU Port using OpenMP #17

Uh oh!

Conversation

joneuhauser commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneuhauser commented Sep 11, 2025 • edited by davecats Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneuhauser commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneuhauser commented Sep 17, 2025

Uh oh!

joneuhauser commented Sep 23, 2025

Uh oh!

joneuhauser commented Sep 24, 2025

Uh oh!

joneuhauser commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneuhauser commented Sep 25, 2025

Uh oh!

joneuhauser commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joneuhauser commented Sep 9, 2025 •

edited

Loading

joneuhauser commented Sep 11, 2025 •

edited by davecats

Loading

joneuhauser commented Sep 17, 2025 •

edited

Loading

joneuhauser commented Sep 25, 2025 •

edited

Loading