We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, using 8 p100,16G to train the model,I get partial error as below
number of params: 75833633 loading annotations into memory... Done (t=20.13s) creating index... index created! Start training Epoch: [0] [ 0/7731] eta: 15 days, 22:29:09 lr: 0.000100 class_error: 88.02 loss: 79.9960 (79.9960) loss_bbox: 7.1450 (7.1450) loss_bbox_0: 6.8492 (6.8492) loss_bbox_1: 6.9822 (6.9822) loss_bbox_2: 7.0482 (7.0482) loss_bbox_3: 7.0655 (7.0655) loss_bbox_4: 7.0992 (7.0992) loss_ce: 3.8196 (3.8196) loss_ce_0: 3.8680 (3.8680) loss_ce_1: 3.7678 (3.7678) loss_ce_2: 3.8110 (3.8110) loss_ce_3: 3.8432 (3.8432) loss_ce_4: 3.8701 (3.8701) loss_dice: 0.7227 (0.7227) loss_giou: 2.3688 (2.3688) loss_giou_0: 2.3062 (2.3062) loss_giou_1: 2.3316 (2.3316) loss_giou_2: 2.3469 (2.3469) loss_giou_3: 2.3410 (2.3410) loss_giou_4: 2.3517 (2.3517) loss_mask: 0.0581 (0.0581) cardinality_error_unscaled: 306.0000 (306.0000) cardinality_error_0_unscaled: 306.0000 (306.0000) cardinality_error_1_unscaled: 306.0000 (306.0000) cardinality_error_2_unscaled: 306.0000 (306.0000) cardinality_error_3_unscaled: 306.0000 (306.0000) cardinality_error_4_unscaled: 306.0000 (306.0000) class_error_unscaled: 88.0208 (88.0208) loss_bbox_unscaled: 1.4290 (1.4290) loss_bbox_0_unscaled: 1.3698 (1.3698) loss_bbox_1_unscaled: 1.3964 (1.3964) loss_bbox_2_unscaled: 1.4096 (1.4096) loss_bbox_3_unscaled: 1.4131 (1.4131) loss_bbox_4_unscaled: 1.4198 (1.4198) loss_ce_unscaled: 3.8196 (3.8196) loss_ce_0_unscaled: 3.8680 (3.8680) loss_ce_1_unscaled: 3.7678 (3.7678) loss_ce_2_unscaled: 3.8110 (3.8110) loss_ce_3_unscaled: 3.8432 (3.8432) loss_ce_4_unscaled: 3.8701 (3.8701) loss_dice_unscaled: 0.7227 (0.7227) loss_giou_unscaled: 1.1844 (1.1844) loss_giou_0_unscaled: 1.1531 (1.1531) loss_giou_1_unscaled: 1.1658 (1.1658) loss_giou_2_unscaled: 1.1734 (1.1734) loss_giou_3_unscaled: 1.1705 (1.1705) loss_giou_4_unscaled: 1.1759 (1.1759) loss_mask_unscaled: 0.0581 (0.0581) time: 178.1075 data: 138.0209 max mem: 3236 Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hi,
using 8 p100,16G to train the model,I get partial error as below
pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance
The text was updated successfully, but these errors were encountered: