Skip to content

Broadcasts to Shared Memory on GPU Runs in Serial #1284

@computablee

Description

@computablee

Describe the bug
When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.

To Reproduce
Steps to reproduce the behavior:

@dace.program
def reproduce():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        smem[:] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

This code outputs the following CUDA for smem[:] = 0:

__shared__ float smem[1024];
{
    for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
        for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
            {
                float __out;

                ///////////////////
                // Tasklet code (assign_13_8)
                __out = 0.0;
                ///////////////////

                smem[((32 * __i0) + __i1)] = __out;
            }
        }
    }
}

Expected behavior
Since the outermost loop is a GPU_Device loop, the broadcast should be a GPU_ThreadBlock loop. Using the following Python code:

@dace.program
def correct():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        
        for k, l in dace.map[0:I1, 0:J1]:
            smem[k, l] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

The correct CUDA is generated:

 __shared__ float smem[1024];
{
    {
        {
            int l = threadIdx.x;
            int k = threadIdx.y;
            {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_28_12)
                        __out = 0.0;
                        ///////////////////

                        smem[((32 * k) + l)] = __out;
                    }
                }
            }
        }
    }
}

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: RHEL 7.9
  • DaCe Version 0.14.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions