Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: adding a parameter to control the number of processes used by the validation dataloader #2053

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ancestor-mithril
Copy link
Contributor

Motivation

Many nnUNet users try to run it on systems in which the hardware is lacking (there are many users reporting Some background worker is 6 feet under).

Currently, the validation dataloader uses half the number of processes of the training dataloader. A default training with nnUNet uses 1 main process + 12 train dataloader processes + 6 validation dataloader processes, which copy all the main's process allocated memory.
On systems with modest RAM resources, if there is enough RAM utilization during training, the validation dataloaders' memory (including the validation data that was already loaded) is moved on the swap partition because the validation dataloaders sleep and wait for the end of the training epoch, which makes the start of the validation very slow.
Reducing the number of dataloader processes (nnUNet_n_proc_DA) avoids OOM errors and slowdowns on systems without enough RAM memory that start to use the swap partition.

New feature

Reducing the number of validation dataloader processes (with the new environment variable nnUNet_n_proc_DA_val) enables better resource management and allows for allocating more processes for training by reducing the processes for validation. This solution is especially useful when the default value of num_val_iterations_per_epoch is changed from 50 to 10.

@FabianIsensee
Copy link
Member

Hey, is this really such a big issue? When searching for Some background worker is 6 feet under I only find this issue.
The error people should see when background workers for DA are missing is 'One or more background workers are no longer alive. Exiting. Please check [...]' and even for that I don't find any RAM-related issues in the repo. Is there something I am missing?
The default nnU-Net takes ~10GB of RAM (Task002_Heart), all background processes included. That shouldn't cause any issues, really

@ancestor-mithril
Copy link
Contributor Author

This is strange, when I search for One or more background workers are no longer alive on github I also find next to nothing, but I am positive I saw multiple comments and issues with this error.

Regardless, I will show you my example. This is a training on Amos22 with the pip installed nnUNet and 10 data augmentation workers. The last 5 processes are the validation workers, each occupying at least 0.5GB of RAM.

PID     %CPU %MEM VSZ    RSS  START TIME    COMMAND
2083448 72.7 1.4  25416M 954M Apr09 1200:54 /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084441 50.6 1.0  9886M  646M Apr09 833:35  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084442 50.5 0.9  9829M  637M Apr09 832:29  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084443 50.5 0.7  9687M  492M Apr09 831:26  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084444 50.5 0.7  10005M 451M Apr09 832:09  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084445 50.7 0.8  9820M  519M Apr09 834:21  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084450 50.6 0.7  9687M  502M Apr09 833:59  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084452 50.6 0.4  9907M  258M Apr09 832:38  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084455 50.5 0.5  10093M 330M Apr09 832:06  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084456 50.5 0.9  9814M  578M Apr09 831:42  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2084458 50.5 0.8  9711M  525M Apr09 832:22  /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2085306 2.5  0.9  24933M 593M Apr09 41:25   /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2085307 2.4  0.9  24965M 626M Apr09 40:23   /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2085308 2.5  1.1  25090M 712M Apr09 42:12   /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2085309 2.6  0.9  24926M 593M Apr09 42:42   /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all
2085310 2.5  0.6  24754M 443M Apr09 42:24   /opt/conda/bin/python3.11 /opt/conda/bin/nnUNetv2_train 1813 3d_fullres all 

Starting another training in parallel will increase the RAM usage and the OS will start swapping (I have enough cores and VRAM but not enough RAM). As a consequence, objects in memory owned by the validation workers (which are sleeping) will be moved to the swap partition. With more DA processes, the training will be slower due to swap usage.
But if I remove the validation workers and decrease the validation steps to 1, I can start 2 trainings in parallel and I can allocate more workers to each training while also maintaining considerable speed.

I don't think this is an isolated case, training nnUNet does not require a lot of VRAM, therefore modest workstations can also be used for training with some adjustments to the number of allocated processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants