-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Labels
Description
Describe the bug
When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.
To Reproduce
Steps to reproduce the behavior:
@dace.program
def reproduce():
value: dace.float32 = 0
for i, j in dace.map[0:I:I1, 0:J:J1]:
smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
smem[:] = 0.0
for i1, j1 in dace.map[0:I1, 0:J1]:
value += smem[i1, j1]
return value
sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()
blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
print(sdfg())
This code outputs the following CUDA for smem[:] = 0
:
__shared__ float smem[1024];
{
for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
{
float __out;
///////////////////
// Tasklet code (assign_13_8)
__out = 0.0;
///////////////////
smem[((32 * __i0) + __i1)] = __out;
}
}
}
}
Expected behavior
Since the outermost loop is a GPU_Device
loop, the broadcast should be a GPU_ThreadBlock
loop. Using the following Python code:
@dace.program
def correct():
value: dace.float32 = 0
for i, j in dace.map[0:I:I1, 0:J:J1]:
smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
for k, l in dace.map[0:I1, 0:J1]:
smem[k, l] = 0.0
for i1, j1 in dace.map[0:I1, 0:J1]:
value += smem[i1, j1]
return value
sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()
blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
print(sdfg())
The correct CUDA is generated:
__shared__ float smem[1024];
{
{
{
int l = threadIdx.x;
int k = threadIdx.y;
{
{
{
float __out;
///////////////////
// Tasklet code (assign_28_12)
__out = 0.0;
///////////////////
smem[((32 * k) + l)] = __out;
}
}
}
}
}
}
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: RHEL 7.9
- DaCe Version 0.14.2