Most CPU threads are idle with multi-GPU and multi-dataloader training #10963

fuy34 · 2021-12-06T21:09:21Z

fuy34
Dec 6, 2021

I am training a multi-task model with 4 GPUs distributed_backend='ddp', and I set num_workers=6 for Dataloader.

Based on my understanding, each GPU process will be assigned with 6 dataloaders, and they are supposed to perform data loading and augmentation simultaneously. But I found in the most of time, most of CPU threads are idle, and it seems only one main threads is working heavily (shown in the screenshot below).

My dataset is fairly large (~1TB) and contains lots of small files (images, depth map, instance segmentation, etc). I also have a customized data augmentation performs random scaling, flipping and cropping. The training speed is fast in the first 100 - 200 steps, and becomes 4-5 times slower after that. The GPU usage frequently stay 0% to wait the data. I am not sure if this is a general pytorch issue or related to pytorch-lightning specifically.

Is there any way to make all CPU threads active for the data loading? Thank you in advance!

tchaton · 2021-12-07T11:54:03Z

tchaton
Dec 7, 2021
Maintainer

Dear @fuy34,

Yes, something looks wrong. Would you mind sharing some reproducible script we could use to debug this behavior. All processes should create 6 workers to load the data.

1 reply

fuy34 Dec 8, 2021
Author

Hi @tchaton

Thank you! Unfortunately, I cannot release the code currently.

One thing I realized was that cv2 could be one of the reasons. Because there are RGB images in the png16 format in the dataset. I used cv2 to read and resize them. It seems cv2 suffers from some multi-threading issues. After I replaced all the functions related to cv2 with (pypng + skimage.transform.resize), the CPU usage looks slightly more normal (as attached). BTW, except for the RGB png16 image, I use PIL.Image to process the other images.

If possible, I will try to reproduce this with some public available datasets later. Please let me know if there is any other possible reason in your mind. Any suggestion is appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Most CPU threads are idle with multi-GPU and multi-dataloader training #10963

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Most CPU threads are idle with multi-GPU and multi-dataloader training #10963

Uh oh!

fuy34 Dec 6, 2021

Replies: 1 comment · 1 reply

Uh oh!

tchaton Dec 7, 2021 Maintainer

Uh oh!

Uh oh!

fuy34 Dec 8, 2021 Author

fuy34
Dec 6, 2021

Replies: 1 comment 1 reply

tchaton
Dec 7, 2021
Maintainer

fuy34 Dec 8, 2021
Author