-
-
Notifications
You must be signed in to change notification settings - Fork 776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Might be a bug with ReductionKernel and stream capture #8318
Comments
It is rare that we see adventurous users trying out stream capture (it was sorta in an "experimental" state so we didn't advertise it, #6290), so thanks for reaching out and raising the question! I would think, at least from CUDA perspective, that cases 5 & 6 are expected. The key point is: During stream capture, there's no actual kernel launch. So all CUDA sees with this line
during capture is:
By the time the graph is launched, the recorded pointer addresses would be reused for the actual kernel launch. Note that for step 2 we rely on the fact there's a mempool; if we were to disable the pool and only use bare |
Thanks for the answer! To be honest, I am not very familiar with CUDA and I am just an amateur in programming and therefore I could not fully understand the behavior when capturing a stream from your answer. For example, I still don’t understand why the value at address |
Sorry I dropped the ball. @Sa1ntPr0 these are all legit questions. Let me focus on Case 5 since the confusion comes from the same root cause (interplay between Python, CuPy, and CUDA).
In Case 5, it's because originally you have a=cp.asarray(10,dtype=cp.float32) in the beginning, but later during capture of graph 2 you bind a new array instance to the name with stream:
...
a=CudaMin(A)
... and so at later times when |
Description
Perhaps this behavior is intentional, but I did not find information about it and it really confused me when I encountered it. I am sorry if I am missing something.
I recorded 2 graphs via stream capture and used cupy.ReductionKernel in both graphs - I needed to write the minimum value of some array into a 0-dimensional CuPy variable. I was faced with the fact that one of the graphs didn't work as intended, but didn't cause any errors either. It turned out that this behavior was caused by the line a=CudaMin(A) in both graphs; replacing it with CudaMin(A,out=a) solved the problem. (CudaMin is my ReductionKernel)
When a=CudaMin(A) was used, this operation was skipped in one of the graphs and the variable "a" remained unchanged.
I hope this code shows what I'm talking about. The most confusing cases are 5 and 6.
To Reproduce
Just imports, this part is the same for all code blocks below
Case 1:
Output 1:
Case 2:
Output 2:
Case 3:
Output 3:
Case 4:
Output 4:
Case 5:
Output 5:
Case 6:
Output 6:
Installation
Conda-Forge (
conda install ...
)Environment
Additional Information
No response
The text was updated successfully, but these errors were encountered: