You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
当我在用A100运行微调代码的时候,出现torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
#178
Open
1 of 2 tasks
2279072142 opened this issue
Aug 1, 2024
· 0 comments
报错信息:torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
System Info / 系統信息
报错信息:torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
cuda版本:12.1
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
运行 deepspeed peft_lora.py --ds_config ds_config.yaml
Expected behavior / 期待表现
能够正常运行微调代码
The text was updated successfully, but these errors were encountered: