You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Julia variant: I am out of memory when starting. See the following log:
julia>include("main.jl")
[15:36:03] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2:../../../data/caltech256-train.rec, use 4 threads for decoding..
[15:36:03] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2:../../../data/caltech256-val.rec, use 4 threads for decoding..
INFO: Start training on MXNet.mx.Context[GPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
[15:36:09] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO: TempSpace: Total 1345 MB allocated on GPU0
INFO: Start training...
[15:36:12] /home/antholzer/mxnet/dmlc-core/include/dmlc/./logging.h:304: [15:36:12] src/storage/./pooled_storage_manager.h:84: cudaMalloc failed: out of memory
Now if I run the same Python variant (alexnet.py, main.py), I have no problems with memory. With a batch size of 128 I am running at 3gb memory and with batch size of 256 at around 4gb.
Note: At least I was able to train the Julia variant with a batch size of 16.
I wonder why the Julia variant blows up its memory... ❓ Anyone has an idea about this or experiences similar issues?
The text was updated successfully, but these errors were encountered:
One potential reason that I could think of immediately is the difference between memory / resource handling in Julia and Python. Python use a ref-counter to aggressively release resources once the ref counting goes to zero, while Julia use a GC. The problem is that GC is potentially only run once in a while or when the memory goes low, but the GPU memory is invisible to Julia GC. So many of the NDArray created on GPU might not be released.
Currently I do not know a better way to handle external resources in Julia. Maybe you can try to explicitly call julia GC after each batch to see if it helps? (albeit it will probably slow down the training process a bit)
Hope that there will be a better solution to handle this problem in the future. Actually this is kind of a game-killer for new people that want to try out Julia and MXNet...
@hesseltuinhof Thanks! Glad to hear that it works. Unfortunately, this is a limitation of Julia. Python has reference counting which is great for managing not only memory but also other resources. But Julia rely on GC for memory management, which is not very good for managing other kind of resources. :(
I am having a severe problem with training AlexNet (see alexnet.jl) in Julia (0.5.2) on my GPU (12gb mem).
I am training on
Caltech256
dataset (see main.jl)Julia variant: I am out of memory when starting. See the following log:
Now if I run the same Python variant (alexnet.py, main.py), I have no problems with memory. With a batch size of 128 I am running at 3gb memory and with batch size of 256 at around 4gb.
Note: At least I was able to train the Julia variant with a batch size of 16.
I wonder why the Julia variant blows up its memory... ❓ Anyone has an idea about this or experiences similar issues?
The text was updated successfully, but these errors were encountered: