Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

Open
MLS2021 opened this issue Nov 18, 2024 · 1 comment
Labels
bug Something isn't working training

Comments

@MLS2021
Copy link

MLS2021 commented Nov 18, 2024

I'm using DeepSpeed for fine-tuning large models. Because of the lack of video memory, I'm using deepspeed_zero2 for training and I'm getting OOM issues. So I switched to deepspeed_zero3. but a new problem appeared:

[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800956 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800651 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[2024-11-18 11:07:07,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539827 closing signal SIGTERM
[2024-11-18 11:07:07,749] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539828 closing signal SIGTERM

1. [2024-11-18 11:07:13,274] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 3539829) of binary: /home/mls01/miniconda3/envs/omg-llava/bin/python

I get the same problem with deepspeed_zero3_offload. The problem is usually during the model weight loading phase. Any replies are appreciated.

@MLS2021 MLS2021 added bug Something isn't working training labels Nov 18, 2024
@MLS2021
Copy link
Author

MLS2021 commented Nov 18, 2024

To add: I have disabled P2P communication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant