You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently started using the nnDetection and have noticed that my training epoch time significantly increased when the size of my training dataset increased.
To be more specific, I ran the nnDetection preprocessing on a large dataset of ~2k CT volumes, then trained a model using the generated splits_final.pkl file. One epoch with this configuration lasted 3 hours.
However, for the exact same preprocessing and training configurations, having only modified the splits_final.pkl file to randomly include a subset (~200 CT volumes) of the original training dataset (~2k CT volumes), the epoch time was reduced to 12 minutes per epoch!
Is there an explanation for this behavior?
Many thanks in advance.
The text was updated successfully, but these errors were encountered:
That sounds rather surprising. Thank you for reporting the issue and sorry for getting back to you rather late due to my vacation. Is it possible to reproduce the issue with the toy dataset so I can have a look locally as well?
Theoretically, training time should remain independent of the dataset size since the same number of batches/samples is sampled in each epoch. 12 minutes per Epoch also sounds extremely fast, usually epoch times range somewhere between 20-40 minutes (sometimes slightly longer) depending on the configured strides of the network and the available GPU (assuming no other bottleneck are present).
Best,
Michael
Edit: the only case which I could think of is the presence of an IO bottleneck and by reducing the number of samples the OS can cache the inputs which alleviates the IO bottleneck. Even then, 12 minutes for an epoch sounds quite quick though and would highly depend on the input to the network (e.g. 3D data which is rather small in resolution)
Indeed, the bottleneck was the data IO, which as you said, by reducing the number of samples the OS can cache the inputs.
However, we managed to alleviate the problem by changing the type of saved arrays from numpy memmap objects to zarr objects (the evolution of hdf5). By loading zarr arrays the code runs approximately 3 times faster on large datasets (~2k CT scans) compared to the original numpy configuration and the bottleneck is again the computation on the GPU.
I would highly suggest looking into this for the data IO. https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/
Hello,
I recently started using the nnDetection and have noticed that my training epoch time significantly increased when the size of my training dataset increased.
To be more specific, I ran the nnDetection preprocessing on a large dataset of ~2k CT volumes, then trained a model using the generated splits_final.pkl file. One epoch with this configuration lasted 3 hours.
However, for the exact same preprocessing and training configurations, having only modified the splits_final.pkl file to randomly include a subset (~200 CT volumes) of the original training dataset (~2k CT volumes), the epoch time was reduced to 12 minutes per epoch!
Is there an explanation for this behavior?
Many thanks in advance.
The text was updated successfully, but these errors were encountered: