Add ThreadBlock Maps as Preprocessing #2048

ThrudPrimrose · 2025-06-15T10:43:47Z

I am testing how many existing test cases would break if we were to add a "thread block" map to all gpu kernels, if it does not exist, to enable simplification on codegen by removing case distinctions. [WIP]

…cessing

tbennun · 2025-06-16T00:25:58Z

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass.
Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

ThrudPrimrose · 2025-06-16T08:24:55Z

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

Yes, it is probably that. I used a variation of this transformation for the auto-tiling passes, therefore, I think the transformation should exist. We could have a pass version of the same transformation such that it aligns in style with other preprocessing passes, then we can have both.

ThrudPrimrose · 2025-06-16T11:00:45Z

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:

    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code

If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8].
The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.

But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)

alexnick83

LGTM

alexnick83 · 2025-06-16T11:55:00Z

dace/transformation/dataflow/add_threadblock_map.py

+                        print_warning = True
+                    self.thread_block_size_y = 1
+                    self.thread_block_size_z = 1
+                new_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z)


I am thinking that it would be simpler to write something similar to:

old_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z) new_block = list(old_block) for d in range(3, num_dims_in_map, -1): new_block[d-1] *= new_block[d] new_block[d] = 1 new_block = tuple(new_block) if new_block != old_block: warnings.warn ...

It may not be immediately as readable as the current code though, so this just a suggestion.

alexnick83 · 2025-06-16T12:00:51Z

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.
Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:
    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code
If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.

But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)

I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments.

ThrudPrimrose · 2025-06-16T12:35:18Z

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.
Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:
    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code
If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.
But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)
I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments.

I think collapsing is also fine, I think the issues are currently related to GPU_Persistent maps (and GPU Device maps within)

ThrudPrimrose added 3 commits June 15, 2025 12:41

If no threadblock map in a GPU device kernel's scoep add it as prepro…

fecdb50

…cessing

refactor for CI

35eb604

Improve pass to accept default config schema value

952e7a0

tbennun changed the title ~~[WIP] Add ThreadBlock Maps as Preprocessing~~ Add ThreadBlock Maps as Preprocessing Jun 16, 2025

tbennun marked this pull request as draft June 16, 2025 00:46

1

6f51797

ThrudPrimrose requested review from alexnick83 and phschaad June 16, 2025 11:01

ThrudPrimrose added 3 commits June 16, 2025 13:11

Restore default behaviors as expected

d8b4969

Refactor

aa68b1a

Update

c97a8fd

alexnick83 approved these changes Jun 16, 2025

View reviewed changes

Check nodes recursively now

c8fae8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ThreadBlock Maps as Preprocessing #2048

Add ThreadBlock Maps as Preprocessing #2048

Uh oh!

ThrudPrimrose commented Jun 15, 2025

Uh oh!

tbennun commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025 •

edited

Loading

Uh oh!

alexnick83 left a comment

Uh oh!

alexnick83 Jun 16, 2025

Uh oh!

alexnick83 commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025

Uh oh!

Uh oh!

Add ThreadBlock Maps as Preprocessing #2048

Are you sure you want to change the base?

Add ThreadBlock Maps as Preprocessing #2048

Uh oh!

Conversation

ThrudPrimrose commented Jun 15, 2025

Uh oh!

tbennun commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexnick83 left a comment

Choose a reason for hiding this comment

Uh oh!

alexnick83 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

alexnick83 commented Jun 16, 2025

Uh oh!

ThrudPrimrose commented Jun 16, 2025

Uh oh!

Uh oh!

ThrudPrimrose commented Jun 16, 2025 •

edited

Loading