performance of odlcuda #29

mehrhardt · 2017-11-09T16:48:59Z

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)

impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms

impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms

impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s

impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s

impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s

impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s

impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)

impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276

impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915

impl:cuda, shape:[262, 65000], nsubsets:16

time:2.36511826515, average:0.147819891572

impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772

impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905

impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634

impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s

The text was updated successfully, but these errors were encountered:

kohr-h · 2017-11-10T15:56:54Z

Hard to tell from this. I assume you have a CUDA-adapted version of KullbackLeibler, otherwise the CUDA times would be worse.
To get a more precise picture about the time spent in different parts of the code, I recommend using line_profiler. Just follow the instructions there, and in your case decorate the _call of KullbackLeibler with @profile, and run each of the splittings separately. You will then get a nice breakdown of the relative amount of time spent in each line of the function. It's extremely helpful and should help you find the bottlenecks.

mehrhardt · 2017-11-10T20:27:04Z

This is a great tool! I will play around with that a bit to see what I can improve here.

Interestingly it has a problem with ODL's _call function:

Traceback (most recent call last):
File "/home/me404/.conda/envs/spdhg-py27/bin/kernprof", line 6, in
sys.exit(kernprof.main())
File "/home/me404/.local/lib/python2.7/site-packages/kernprof.py", line 222, in main
execfile(script_file, ns, ns)
File "test_times_KL_profile.py", line 52, in
f_prox = f[i].convex_conj.proximal(x[i])
File "/store/DAMTP/me404/repositories/github_myODL/odl/operator/operator.py", line 422, in new
call_has_out, call_out_optional, _ = _dispatch_call_args(cls)
File "/store/DAMTP/me404/repositories/github_myODL/odl/operator/operator.py", line 248, in _dispatch_call_args
"".format(signature) + spec_msg)
ValueError: bad signature '_call(*args, **kwds)': variable arguments not allowed
Possible signatures are ('[, **kwargs]' means optional):

_call(self, x[, **kwargs])
_call(self, x, out[, **kwargs])
_call(self, x, out=None[, **kwargs])

kohr-h · 2017-11-10T23:24:26Z

Okay, it seems to patch the function in place, changing its signature, which we don't allow. You can avoid that issue by temporarily adding an indirection, like so:

Before:

class MyOperator(...):
    ...
    @profile  # <-- not allowed, raises error
    def _call(self, x, out):
        # do stuff
        return out

After:

class MyOperator(...):
    ...
    def _call(self, x, out):
        return self._call_temp(x, out)

    @profile  # <-- all good
    def _call_temp(self, x, out):
        # do stuff
        return out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of odlcuda #29

performance of odlcuda #29

mehrhardt commented Nov 9, 2017

kohr-h commented Nov 10, 2017

mehrhardt commented Nov 10, 2017

kohr-h commented Nov 10, 2017

performance of odlcuda #29

performance of odlcuda #29

Comments

mehrhardt commented Nov 9, 2017

kohr-h commented Nov 10, 2017

mehrhardt commented Nov 10, 2017

kohr-h commented Nov 10, 2017