[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

MLS2021 · 2024-11-18T03:31:46Z

I'm using DeepSpeed for fine-tuning large models. Because of the lack of video memory, I'm using deepspeed_zero2 for training and I'm getting OOM issues. So I switched to deepspeed_zero3. but a new problem appeared:

[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800956 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800651 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[2024-11-18 11:07:07,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539827 closing signal SIGTERM
[2024-11-18 11:07:07,749] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539828 closing signal SIGTERM

1. [2024-11-18 11:07:13,274] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 3539829) of binary: /home/mls01/miniconda3/envs/omg-llava/bin/python

I get the same problem with deepspeed_zero3_offload. The problem is usually during the model weight loading phase. Any replies are appreciated.

The text was updated successfully, but these errors were encountered:

MLS2021 · 2024-11-18T03:32:18Z

To add: I have disabled P2P communication

MLS2021 added bug Something isn't working training labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

MLS2021 commented Nov 18, 2024

MLS2021 commented Nov 18, 2024

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

Comments

MLS2021 commented Nov 18, 2024

MLS2021 commented Nov 18, 2024