Significant Change in Epoch Time with Dataset Size #208

dzeego · 2023-11-02T12:28:37Z

Hello,

I recently started using the nnDetection and have noticed that my training epoch time significantly increased when the size of my training dataset increased.

To be more specific, I ran the nnDetection preprocessing on a large dataset of ~2k CT volumes, then trained a model using the generated splits_final.pkl file. One epoch with this configuration lasted 3 hours.
However, for the exact same preprocessing and training configurations, having only modified the splits_final.pkl file to randomly include a subset (~200 CT volumes) of the original training dataset (~2k CT volumes), the epoch time was reduced to 12 minutes per epoch!

Is there an explanation for this behavior?

Many thanks in advance.

mibaumgartner · 2023-11-08T17:26:22Z

Dear @dzeego ,

That sounds rather surprising. Thank you for reporting the issue and sorry for getting back to you rather late due to my vacation. Is it possible to reproduce the issue with the toy dataset so I can have a look locally as well?

Theoretically, training time should remain independent of the dataset size since the same number of batches/samples is sampled in each epoch. 12 minutes per Epoch also sounds extremely fast, usually epoch times range somewhere between 20-40 minutes (sometimes slightly longer) depending on the configured strides of the network and the available GPU (assuming no other bottleneck are present).

Best,
Michael

Edit: the only case which I could think of is the presence of an IO bottleneck and by reducing the number of samples the OS can cache the inputs which alleviates the IO bottleneck. Even then, 12 minutes for an epoch sounds quite quick though and would highly depend on the input to the network (e.g. 3D data which is rather small in resolution)

dzeego · 2023-12-04T09:31:02Z

Hi @mibaumgartner,

Indeed, the bottleneck was the data IO, which as you said, by reducing the number of samples the OS can cache the inputs.
However, we managed to alleviate the problem by changing the type of saved arrays from numpy memmap objects to zarr objects (the evolution of hdf5). By loading zarr arrays the code runs approximately 3 times faster on large datasets (~2k CT scans) compared to the original numpy configuration and the bottleneck is again the computation on the GPU.
I would highly suggest looking into this for the data IO.
https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/

Best regards,
Dzeego

mibaumgartner · 2023-12-04T10:45:53Z

Dear @dzeego ,

thank you for the suggestion, I'll definitely look into it!

Best,
Michael

mibaumgartner added the enhancement New feature or request label Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Change in Epoch Time with Dataset Size #208

Significant Change in Epoch Time with Dataset Size #208

dzeego commented Nov 2, 2023

mibaumgartner commented Nov 8, 2023 •

edited

Loading

dzeego commented Dec 4, 2023

mibaumgartner commented Dec 4, 2023

Significant Change in Epoch Time with Dataset Size #208

Significant Change in Epoch Time with Dataset Size #208

Comments

dzeego commented Nov 2, 2023

mibaumgartner commented Nov 8, 2023 • edited Loading

dzeego commented Dec 4, 2023

mibaumgartner commented Dec 4, 2023

mibaumgartner commented Nov 8, 2023 •

edited

Loading