Skip to content

[flang][openmp] loop body is executed too many times #1340

@VeeEM

Description

@VeeEM

In ORB5 we have a large kernel that is being executed too many times when compiled with AMD's flang with optimization levels O2 or O3.

Unfortunately I do not have a small reproducer for this one, but I'll try to describe the issue and maybe we can figure something out.

So if I have a loop from 1 to n that is parallelized using an !$omp target teams distribute parallel do directive, I expect the loop body to be executed n times. If I compile the below subroutine and call it in a program, that is indeed what happens, 1 is printed n times.

subroutine s(n)
integer :: n
integer :: i
integer, dimension(:), allocatable :: debug_counter

allocate(debug_counter(n))

debug_counter = 0
!$omp target teams distribute parallel do
do i=1, 1000
  debug_counter(i) = debug_counter(i) + 1
end do
!$omp end target teams distribute parallel do

print *, debug_counter

end subroutine s

Output:

 1 1 1 1 1 1 1 1 1 1 1 1 1...

The ORB5 kernel is also just one loop with a target teams distribute parallel do, but it's quite a large one, with many thread private variables and function calls. It's the largest kernel in the program actually. There's also atomics used in it, but I'm not aware of any programming language features used here that aren't also used in other kernels that work fine.

The issue is that the kernel takes too long to run and any results computed by it that add to a previous value like the counter example above. If I add a debug_counter to the kernel, instead of 1 1 1 1 1..., I get:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
 30 31 32 32 32 32 32 32 32 32 32 32 32...

The counter values match the factor by which the results are wrong. The loop body seems to be executed one time with i=1, two times with i=2 and so on until 32 and then the loop body is executed 32 times for the rest of the values.

This seems to only happen with AMD's fork, not with upstream flang.

Happens only with optimization levels O2 and O3. Actually, it seems to happen only when the openmp-opt-cgscc llvm pass runs. The kernel works also with O2 and O3 if compiled with -mllvm -openmp-opt-disable.

Tested with rocm versions 6.3.4 and 7.1. GPUs MI250x and RX 7900 GRE.

Tested with amd-staging at 7920fe9

Upstream at a73bdba

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions