SYS.DISTRIBUTED #9

marson666 · 2022-03-23T07:49:50Z

I'm trying to train xing processed_data from scratch using DDP,
SYS.DISTRIBUTED True
SYS.WORLD_SIZE 4 (4 GPUS)

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

The text was updated successfully, but these errors were encountered:

ShenhanQian · 2022-03-23T12:58:16Z

Thanks for reporting this problem. We have tested and fixed it by the latest commit. Please pull and check if the problem is gone on your side.

marson666 · 2022-04-01T06:59:37Z

maybe here should also add @torch.no_grad()

ShenhanQian · 2022-04-01T07:35:13Z

Why?

marson666 · 2022-04-02T03:41:16Z

I use your latest repo to run the code and encounter the wrong error below, so i add @torch.no_grad() to reduce_tensor_dict and it run successfully.
python=3.8.12 torch=1.10.0+cu113

`Traceback (most recent call last):
File "main.py", line 73, in
main()
File "main.py", line 64, in main
mp.spawn(run_distributed,
File "/data/home/sen.ma/miniconda3/envs/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/home/sen.ma/miniconda3/envs/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/home/sen.ma/miniconda3/envs/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/data/home/sen.ma/miniconda3/envs/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/home/sen.ma/projects/SpeechDrivesTemplates/main.py", line 58, in run_distributed
run(args, cfg)
File "/data/home/sen.ma/projects/SpeechDrivesTemplates/main.py", line 51, in run
pipeline.train(cfg, exp_tag, args.resume_from)
File "/data/home/sen.ma/projects/SpeechDrivesTemplates/core/pipelines/trainer.py", line 389, in train
self.train_step(batch, t_step+1, global_step, epoch)
File "/data/home/sen.ma/projects/SpeechDrivesTemplates/core/pipelines/voice2pose.py", line 312, in train_step
self.reduce_tensor_dict(losses_dict)
File "/data/home/sen.ma/projects/SpeechDrivesTemplates/core/pipelines/trainer.py", line 329, in reduce_tensor_dict
v /= self.get_world_size()
RuntimeError: Output 0 of _DDPSinkBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
`

ShenhanQian · 2022-04-02T04:05:09Z

I don't encounter the same problem on my machine. Maybe it's caused by PyTorch version difference (I'm using PyTorch1.7.0).

Since reduce_tensor_dict is only used for logging rather than training, using @torch.no_grad() is ok.

As another solution (which avoids in-place operation), I suggest substituting v /= self.get_world_size() with tensor_dict[k] = v / self.get_world_size().

ShenhanQian · 2022-04-02T04:40:40Z

I have tested with the latest PyTorch distribution (1.11.0) and have not encountered the above problem. Did you modify the code and use reduce_tensor_dict before the backprop?

marson666 · 2022-04-02T07:12:05Z

I just change 2D landmarks to 3D, and comment the lip_sync_error

marson666 changed the title ~~SYS.~~ SYS.DISTRIBUTED Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYS.DISTRIBUTED #9

SYS.DISTRIBUTED #9

marson666 commented Mar 23, 2022

ShenhanQian commented Mar 23, 2022

marson666 commented Apr 1, 2022

ShenhanQian commented Apr 1, 2022 •

edited

Loading

marson666 commented Apr 2, 2022

ShenhanQian commented Apr 2, 2022

ShenhanQian commented Apr 2, 2022

marson666 commented Apr 2, 2022

SYS.DISTRIBUTED #9

SYS.DISTRIBUTED #9

Comments

marson666 commented Mar 23, 2022

ShenhanQian commented Mar 23, 2022

marson666 commented Apr 1, 2022

ShenhanQian commented Apr 1, 2022 • edited Loading

marson666 commented Apr 2, 2022

ShenhanQian commented Apr 2, 2022

ShenhanQian commented Apr 2, 2022

marson666 commented Apr 2, 2022

ShenhanQian commented Apr 1, 2022 •

edited

Loading