-
Notifications
You must be signed in to change notification settings - Fork 139
Fix CUDA GlobalToGlobal codegen error by raising NotImplementedError #2041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: tbennun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, test needs improvement.
tests/codegen/gpu_memcpy_test.py
Outdated
B_node = state.add_access("B") | ||
|
||
# Create a simple edge that should trigger GlobalToGlobal copy | ||
state.add_nedge(A_node, B_node, dace.Memlet("A[0:100] -> B[0:100]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot I don't think this test works. You should explicitly put it inside of a GPU_Device map.
maybe try using the NumPy frontend and for i in dace.map[0:N] @ dace.ScheduleType.GPU_Device:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the test to use the proper approach with @dace.program and explicit GPU_Device map scheduling. The test now uses slice assignment B[:] = A[:]
within a GPU_Device map to create a direct copy between GPU_Global arrays, which correctly triggers the NotImplementedError during compilation. Commit 70e8a3b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, waiting for @ThrudPrimrose's review
raise NotImplementedError( | ||
"GPU global memory to global memory copies need to be more explicitly specified in the code. " | ||
"Consider using shared memory, different memory scopes, or explicit synchronization patterns.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the not implemented error sound random?
Glb -> Glb being more explicitly specified in a code means that it needs to be copied within a map (e.g. calling copytomap as a transformation)?
As if there is a program that absolutely needs glb to glb memory copy (or the user wants that), I think asking for different memory locations to be used is irrelevant. Different memory scopes wont help and we don't have explicit synchronization patterns in dace (also what will adding a syncthreads will do)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
NotImplementedError(
"GPU global memory to global memory copies need to be more explicitly specified in the code. Considering moving the copy to a map using the CopyToMap transformation or manually. "
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me expect the error message, please the suggestion I commented on the error message.
The issue is not global to global copy not being implemented the issue is the only function available assumes the source is const.
Btw in the example I show, the error is occurring not because glb to glb copy is not implemented but, the function always assumes the src is const. In this case it is not because src and dst is the same, from
alongside the original:
We could also use enable_if:
to decrease code cluttering. |
This PR addresses a CUDA code generation error that occurs when trying to copy between GPU_Global storage locations within GPU_Device maps. The issue manifested as compilation errors due to non-existent
dace::GlobalToGlobal
function calls being generated.Problem
When the CUDA code generator encounters a copy operation between two
GPU_Global
arrays within aGPU_Device
map context, it attempts to generate function calls like:dace::GlobalToGlobal1D
dace::GlobalToGlobal2D
dace::GlobalToGlobal3D
While some of these functions exist in the runtime library (
GlobalToGlobal1D
andGlobalToGlobal1DDynamic
), higher-dimensional versions and certain edge cases are not implemented, leading to compilation failures with cryptic "cannot be found" errors.Solution
Instead of generating potentially invalid function calls, the code now detects this scenario and raises a clear
NotImplementedError
with guidance:Changes Made
dace/codegen/targets/cuda.py
: Added 6-line check in_emit_copy()
method to detect GlobalToGlobal cases and raise descriptive errortests/codegen/gpu_memcpy_test.py
: Added test case to validate the new behaviorThe fix is minimal and surgical, affecting only the specific problematic case while leaving all other GPU copy operations unchanged.
Benefits
Fixes #335.
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.