Skip to content

Add ThreadBlock Maps as Preprocessing #2048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ThrudPrimrose
Copy link
Collaborator

I am testing how many existing test cases would break if we were to add a "thread block" map to all gpu kernels, if it does not exist, to enable simplification on codegen by removing case distinctions. [WIP]

@tbennun
Copy link
Collaborator

tbennun commented Jun 16, 2025

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass.
Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

@tbennun tbennun changed the title [WIP] Add ThreadBlock Maps as Preprocessing Add ThreadBlock Maps as Preprocessing Jun 16, 2025
@tbennun tbennun marked this pull request as draft June 16, 2025 00:46
@ThrudPrimrose
Copy link
Collaborator Author

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

Yes, it is probably that. I used a variation of this transformation for the auto-tiling passes, therefore, I think the transformation should exist. We could have a pass version of the same transformation such that it aligns in style with other preprocessing passes, then we can have both.

@ThrudPrimrose
Copy link
Collaborator Author

ThrudPrimrose commented Jun 16, 2025

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.

Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:

    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code

If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8].
The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.

But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)

Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

print_warning = True
self.thread_block_size_y = 1
self.thread_block_size_z = 1
new_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that it would be simpler to write something similar to:

old_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z)
new_block = list(old_block)
for d in range(3, num_dims_in_map, -1):
    new_block[d-1] *= new_block[d]
    new_block[d] = 1
new_block = tuple(new_block)
if new_block != old_block:
    warnings.warn ...

It may not be immediately as readable as the current code though, so this just a suggestion.

@alexnick83
Copy link
Contributor

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.
Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:

    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code

If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.

But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)

I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments.

@ThrudPrimrose
Copy link
Collaborator Author

Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Also, I would use the Pass/Pipeline API (specifically passes) to apply the transformation effectively, but this one should be fine too.
Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality.

I have a new question, I updated to check gpu_block_size variable to read if the user set the block size using that feature. Now the question is:

    def test_highdim_default_block_size():
    
        @dace.program
        def tester(a: dace.float64[1024, 1024] @ dace.StorageType.GPU_Global):
            for i, j in dace.map[0:1024, 0:1024] @ dace.ScheduleType.GPU_Device:
                a[i, j] = 1
    
        with dace.config.set_temporary('compiler', 'cuda', 'default_block_size', value='32, 8, 2'):
            with pytest.warns(UserWarning, match='has more dimensions'):
                sdfg = tester.to_sdfg()
                gpu_code = sdfg.generate_code()[1]
>               assert 'dim3(32, 16, 1)' in gpu_code.code
E               assert 'dim3(32, 16, 1)' in '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n'
E                +  where '\n#include <cuda_runtime.h>\n#include <dace/dace.h>\n\n                                                              ...ms[0]);    ////__DACE:0:0:4\n    DACE_KERNEL_LAUNCH_CHECK(__err, "KernelEntryMap_0_0_4", 32, 128, 1, 32, 8, 1);\n}\n\n' = <dace.codegen.codeobject.CodeObject object at 0x7a897435c680>.code

If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. The behaviour before is to collapse the past dimensions, should it be kept as it is? It is actually something I would give an error because IMHO, if a user is passing this, it is possible that they are doing a mistake somewhere, and implicitly doing something to hide errors is probably not a good idea.
But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max)

I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments.

I think collapsing is also fine, I think the issues are currently related to GPU_Persistent maps (and GPU Device maps within)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants