-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of odlcuda #29
Comments
Hard to tell from this. I assume you have a CUDA-adapted version of |
This is a great tool! I will play around with that a bit to see what I can improve here. Interestingly it has a problem with ODL's
|
Okay, it seems to patch the function in place, changing its signature, which we don't allow. You can avoid that issue by temporarily adding an indirection, like so: Before: class MyOperator(...):
...
@profile # <-- not allowed, raises error
def _call(self, x, out):
# do stuff
return out After: class MyOperator(...):
...
def _call(self, x, out):
return self._call_temp(x, out)
@profile # <-- all good
def _call_temp(self, x, out):
# do stuff
return out |
I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.
There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.
Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?
The text was updated successfully, but these errors were encountered: