[flang][openmp] loop body is executed too many times

In ORB5 we have a large kernel that is being executed too many times when compiled with AMD's flang with optimization levels O2 or O3.

Unfortunately I do not have a small reproducer for this one, but I'll try to describe the issue and maybe we can figure something out.

So if I have a loop from 1 to n that is parallelized using an `!$omp target teams distribute parallel do` directive, I expect the loop body to be executed `n` times. If I compile the below subroutine and call it in a program, that is indeed what happens, `1` is printed `n` times.

```
subroutine s(n)
integer :: n
integer :: i
integer, dimension(:), allocatable :: debug_counter

allocate(debug_counter(n))

debug_counter = 0
!$omp target teams distribute parallel do
do i=1, 1000
  debug_counter(i) = debug_counter(i) + 1
end do
!$omp end target teams distribute parallel do

print *, debug_counter

end subroutine s
```
Output:
```
 1 1 1 1 1 1 1 1 1 1 1 1 1...
```

The ORB5 kernel is also just one loop with a `target teams distribute parallel do`, but it's quite a large one, with many thread private variables and function calls. It's the largest kernel in the program actually. There's also atomics used in it, but I'm not aware of any programming language features used here that aren't also used in other kernels that work fine.

The issue is that the kernel takes too long to run and any results computed by it that add to a previous value like the counter example above. If I add a debug_counter to the kernel, instead of `1 1 1 1 1...`, I get:

```
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
 30 31 32 32 32 32 32 32 32 32 32 32 32...
``` 

The counter values match the factor by which the results are wrong. The loop body seems to be executed one time with i=1, two times with i=2 and so on until 32 and then the loop body is executed 32 times for the rest of the values.

This seems to only happen with AMD's fork, not with upstream flang.

Happens only with optimization levels O2 and O3. Actually, it seems to happen only when the openmp-opt-cgscc llvm pass runs. The kernel works also with O2 and O3 if compiled with -mllvm -openmp-opt-disable.

Tested with rocm versions 6.3.4 and 7.1. GPUs MI250x and RX 7900 GRE.

Tested with amd-staging at 7920fe9a3d750863155389bc5ec6fccb0b066f21

Upstream at a73bdba2e80c6cff91ec135b5502909b14934d68


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang][openmp] loop body is executed too many times #1340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[flang][openmp] loop body is executed too many times #1340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions