Broadcasts to Shared Memory on GPU Runs in Serial

**Describe the bug**
When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.

**To Reproduce**
Steps to reproduce the behavior:
```py
@dace.program
def reproduce():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        smem[:] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())
```
This code outputs the following CUDA for `smem[:] = 0`:
```cuda
__shared__ float smem[1024];
{
    for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
        for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
            {
                float __out;

                ///////////////////
                // Tasklet code (assign_13_8)
                __out = 0.0;
                ///////////////////

                smem[((32 * __i0) + __i1)] = __out;
            }
        }
    }
}
```

**Expected behavior**
Since the outermost loop is a `GPU_Device` loop, the broadcast should be a `GPU_ThreadBlock` loop. Using the following Python code:
```py
@dace.program
def correct():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        
        for k, l in dace.map[0:I1, 0:J1]:
            smem[k, l] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())
```
The correct CUDA is generated:
```cuda
 __shared__ float smem[1024];
{
    {
        {
            int l = threadIdx.x;
            int k = threadIdx.y;
            {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_28_12)
                        __out = 0.0;
                        ///////////////////

                        smem[((32 * k) + l)] = __out;
                    }
                }
            }
        }
    }
}
```

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: RHEL 7.9
 - DaCe Version 0.14.2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Broadcasts to Shared Memory on GPU Runs in Serial #1284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Broadcasts to Shared Memory on GPU Runs in Serial #1284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions