-
Notifications
You must be signed in to change notification settings - Fork 80
Description
In ORB5 we have a large kernel that is being executed too many times when compiled with AMD's flang with optimization levels O2 or O3.
Unfortunately I do not have a small reproducer for this one, but I'll try to describe the issue and maybe we can figure something out.
So if I have a loop from 1 to n that is parallelized using an !$omp target teams distribute parallel do directive, I expect the loop body to be executed n times. If I compile the below subroutine and call it in a program, that is indeed what happens, 1 is printed n times.
subroutine s(n)
integer :: n
integer :: i
integer, dimension(:), allocatable :: debug_counter
allocate(debug_counter(n))
debug_counter = 0
!$omp target teams distribute parallel do
do i=1, 1000
debug_counter(i) = debug_counter(i) + 1
end do
!$omp end target teams distribute parallel do
print *, debug_counter
end subroutine s
Output:
1 1 1 1 1 1 1 1 1 1 1 1 1...
The ORB5 kernel is also just one loop with a target teams distribute parallel do, but it's quite a large one, with many thread private variables and function calls. It's the largest kernel in the program actually. There's also atomics used in it, but I'm not aware of any programming language features used here that aren't also used in other kernels that work fine.
The issue is that the kernel takes too long to run and any results computed by it that add to a previous value like the counter example above. If I add a debug_counter to the kernel, instead of 1 1 1 1 1..., I get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 32 32 32 32 32 32 32 32 32 32...
The counter values match the factor by which the results are wrong. The loop body seems to be executed one time with i=1, two times with i=2 and so on until 32 and then the loop body is executed 32 times for the rest of the values.
This seems to only happen with AMD's fork, not with upstream flang.
Happens only with optimization levels O2 and O3. Actually, it seems to happen only when the openmp-opt-cgscc llvm pass runs. The kernel works also with O2 and O3 if compiled with -mllvm -openmp-opt-disable.
Tested with rocm versions 6.3.4 and 7.1. GPUs MI250x and RX 7900 GRE.
Tested with amd-staging at 7920fe9
Upstream at a73bdba