Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-v3多机训练卡住 #2888

Open
elimsjxr opened this issue Jan 9, 2025 · 2 comments
Open

deepseek-v3多机训练卡住 #2888

elimsjxr opened this issue Jan 9, 2025 · 2 comments

Comments

@elimsjxr
Copy link

elimsjxr commented Jan 9, 2025

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
在升腾910b2,swift代码v3.0.1分支跑deepseek-v3 lora微调,切分策略是zero3_offload,权重转换为bf16,8机64卡
报如下warning,程序卡了很久以后会自动kill:
Train: 0%| | 0/2415 [00:00<?, ?it/s][W compiler_depend.ts:26] Warning: Warning: kernel [ArgSort] can not support dtype int32 or int64 on AiCore, Now this kernel is running on AiCpu.If you are more concerned about high-performance execution,please cast dtype to float32. (function operator())
[E compiler_depend.ts:421] [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
what(): [Rank 20] HCCL watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800962 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 54] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800066 milliseconds before timing out.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[INFO:swift] The logging file will be saved in: /work/share/tyy/jxr/deepseek_output/v5-20250108-170417/logging.jsonl
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
image
image

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

ubuntu20.04
transformers 4.37.2
deepspeed 0.14.4
accelerate 0.28.0

Additional context
启动脚本:
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export GPU_NUM_PER_NODE=8
output_dir="/deepseek_output/"
Algorithm_LOG="/ms-swift-3.0.1/deepseek.log"

torchrun $DISTRIBUTED_ARGS ./ms-swift-3.0.1/swift/cli/sft.py
--model .../weights/DeepSeek-V3-BF16-metadata/
--model_type deepseek_v2_5
--dataset AI-ModelScope/alpaca-gpt4-data-en
--train_type lora
--output_dir $output_dir
--deepspeed zero3_offload
2>&1 | tee $Algorithm_LOG

@elimsjxr
Copy link
Author

elimsjxr commented Jan 9, 2025

@DarkLight1337
Copy link

這個跟 vLLM 有什麼關係?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants