Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training #72

Open
Auroralyxa opened this issue May 12, 2022 · 0 comments
Open

Multi-GPU training #72

Auroralyxa opened this issue May 12, 2022 · 0 comments

Comments

@Auroralyxa
Copy link

Auroralyxa commented May 12, 2022

Hi,
using 8 p100,16G to train the model,I get partial error as below

number of params: 75833633
loading annotations into memory...
Done (t=20.13s)
creating index...
index created!
Start training
Epoch: [0]  [   0/7731]  eta: 15 days, 22:29:09  lr: 0.000100  class_error: 88.02  loss: 79.9960 (79.9960)  loss_bbox: 7.1450 (7.1450)  loss_bbox_0: 6.8492 (6.8492)  loss_bbox_1: 6.9822 (6.9822)  loss_bbox_2: 7.0482 (7.0482)  loss_bbox_3: 7.0655 (7.0655)  loss_bbox_4: 7.0992 (7.0992)  loss_ce: 3.8196 (3.8196)  loss_ce_0: 3.8680 (3.8680)  loss_ce_1: 3.7678 (3.7678)  loss_ce_2: 3.8110 (3.8110)  loss_ce_3: 3.8432 (3.8432)  loss_ce_4: 3.8701 (3.8701)  loss_dice: 0.7227 (0.7227)  loss_giou: 2.3688 (2.3688)  loss_giou_0: 2.3062 (2.3062)  loss_giou_1: 2.3316 (2.3316)  loss_giou_2: 2.3469 (2.3469)  loss_giou_3: 2.3410 (2.3410)  loss_giou_4: 2.3517 (2.3517)  loss_mask: 0.0581 (0.0581)  cardinality_error_unscaled: 306.0000 (306.0000)  cardinality_error_0_unscaled: 306.0000 (306.0000)  cardinality_error_1_unscaled: 306.0000 (306.0000)  cardinality_error_2_unscaled: 306.0000 (306.0000)  cardinality_error_3_unscaled: 306.0000 (306.0000)  cardinality_error_4_unscaled: 306.0000 (306.0000)  class_error_unscaled: 88.0208 (88.0208)  loss_bbox_unscaled: 1.4290 (1.4290)  loss_bbox_0_unscaled: 1.3698 (1.3698)  loss_bbox_1_unscaled: 1.3964 (1.3964)  loss_bbox_2_unscaled: 1.4096 (1.4096)  loss_bbox_3_unscaled: 1.4131 (1.4131)  loss_bbox_4_unscaled: 1.4198 (1.4198)  loss_ce_unscaled: 3.8196 (3.8196)  loss_ce_0_unscaled: 3.8680 (3.8680)  loss_ce_1_unscaled: 3.7678 (3.7678)  loss_ce_2_unscaled: 3.8110 (3.8110)  loss_ce_3_unscaled: 3.8432 (3.8432)  loss_ce_4_unscaled: 3.8701 (3.8701)  loss_dice_unscaled: 0.7227 (0.7227)  loss_giou_unscaled: 1.1844 (1.1844)  loss_giou_0_unscaled: 1.1531 (1.1531)  loss_giou_1_unscaled: 1.1658 (1.1658)  loss_giou_2_unscaled: 1.1734 (1.1734)  loss_giou_3_unscaled: 1.1705 (1.1705)  loss_giou_4_unscaled: 1.1759 (1.1759)  loss_mask_unscaled: 0.0581 (0.0581)  time: 178.1075  data: 138.0209  max mem: 3236
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete

pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant