Remove unnecessary device and stream syncs#129
Remove unnecessary device and stream syncs#129Edenzzzz wants to merge 2 commits intoHazyResearch:mainfrom
Conversation
| cudaStreamSynchronize(stream); | ||
| cudaDeviceSynchronize(); | ||
| // cudaStreamSynchronize(stream); | ||
| // cudaDeviceSynchronize(); |
There was a problem hiding this comment.
No need to call these when there isn't a immediate error checking
| } | ||
|
|
||
| CHECK_CUDA_ERROR(cudaGetLastError()); | ||
| cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Usually we don't always do this, torch uses a flag CUDA_LAUNCH_BLOCKING
|
cc @DanFu09 |
|
This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes I'll work on upstreaming this week that should address this. |
|
Could you elaborate on the correctness issue? Precision for layernorm is the same |
|
Here's flashinfer's stream dispatch (without any sync) |
|
Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag. I have a solution sitting around on a branch, I just need to move it to main :) |
|
Oh I see, thank you! |
|
Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn Any other optimizations you see there? It's pretty old code at this point :) |
|
I think the device syncs in lin_attn.cu and layer norm can be removed :) |
When reading lots of TK kernels, I see a bunch of
cudaDeviceSynchronizeandcudaStreamSynchronize(), even sometimes consecutively. However, they are very expensive and should not be used inside general kernel dispatch except for testing and debugging purposes.I see the following use cases for them:
cudaDeviceSynchronizecan be used to time kernels in benchmarks (though cuda events will likely be more accurate)cudaDeviceSynchronizefor debugging, and we can add a debug flag to trigger stream sync.Removing them can speed up 10%-20% when porting code to general use cases.
See hao-ai-lab/FastVideo#517
Thanks.