deepseek-v3多机训练卡住 #2888

elimsjxr · 2025-01-09T03:47:46Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)
在升腾910b2,swift代码v3.0.1分支跑deepseek-v3 lora微调，切分策略是zero3_offload，权重转换为bf16，8机64卡
报如下warning，程序卡了很久以后会自动kill：
Train: 0%| | 0/2415 [00:00<?, ?it/s][W compiler_depend.ts:26] Warning: Warning: kernel [ArgSort] can not support dtype int32 or int64 on AiCore, Now this kernel is running on AiCpu.If you are more concerned about high-performance execution,please cast dtype to float32. (function operator())
[E compiler_depend.ts:421] [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
what(): [Rank 20] HCCL watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800962 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 54] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800066 milliseconds before timing out.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[INFO:swift] The logging file will be saved in: /work/share/tyy/jxr/deepseek_output/v5-20250108-170417/logging.jsonl
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

ubuntu20.04
transformers 4.37.2
deepspeed 0.14.4
accelerate 0.28.0

Additional context
启动脚本：
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export GPU_NUM_PER_NODE=8
output_dir="/deepseek_output/"
Algorithm_LOG="/ms-swift-3.0.1/deepseek.log"

torchrun $DISTRIBUTED_ARGS ./ms-swift-3.0.1/swift/cli/sft.py
--model .../weights/DeepSeek-V3-BF16-metadata/
--model_type deepseek_v2_5
--dataset AI-ModelScope/alpaca-gpt4-data-en
--train_type lora
--output_dir $output_dir
--deepspeed zero3_offload
2>&1 | tee $Algorithm_LOG

The text was updated successfully, but these errors were encountered:

elimsjxr · 2025-01-09T06:28:53Z

@DarkLight1337 @jeejeelee @youkaichao

DarkLight1337 · 2025-01-09T06:33:29Z

這個跟 vLLM 有什麼關係？

hwei-hw · 2025-03-06T03:31:19Z

想问下问题解决了吗？

elimsjxr · 2025-03-06T05:39:21Z

想问下问题解决了吗？

没有解决

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek-v3多机训练卡住 #2888

deepseek-v3多机训练卡住 #2888

elimsjxr commented Jan 9, 2025 •

edited

Loading

elimsjxr commented Jan 9, 2025

DarkLight1337 commented Jan 9, 2025

hwei-hw commented Mar 6, 2025

elimsjxr commented Mar 6, 2025

deepseek-v3多机训练卡住 #2888

deepseek-v3多机训练卡住 #2888

Comments

elimsjxr commented Jan 9, 2025 • edited Loading

elimsjxr commented Jan 9, 2025

DarkLight1337 commented Jan 9, 2025

hwei-hw commented Mar 6, 2025

elimsjxr commented Mar 6, 2025

elimsjxr commented Jan 9, 2025 •

edited

Loading