-
Notifications
You must be signed in to change notification settings - Fork 137
Add ThreadBlock Maps as Preprocessing #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality. |
Yes, it is probably that. I used a variation of this transformation for the auto-tiling passes, therefore, I think the transformation should exist. We could have a pass version of the same transformation such that it aligns in style with other preprocessing passes, then we can have both. |
I have a new question, I updated to check
If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
print_warning = True | ||
self.thread_block_size_y = 1 | ||
self.thread_block_size_z = 1 | ||
new_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that it would be simpler to write something similar to:
old_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z)
new_block = list(old_block)
for d in range(3, num_dims_in_map, -1):
new_block[d-1] *= new_block[d]
new_block[d] = 1
new_block = tuple(new_block)
if new_block != old_block:
warnings.warn ...
It may not be immediately as readable as the current code though, so this just a suggestion.
I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments. |
I think collapsing is also fine, I think the issues are currently related to GPU_Persistent maps (and GPU Device maps within) |
I am testing how many existing test cases would break if we were to add a "thread block" map to all gpu kernels, if it does not exist, to enable simplification on codegen by removing case distinctions. [WIP]