You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
在升腾910b2,swift代码v3.0.1分支跑deepseek-v3 lora微调,切分策略是zero3_offload,权重转换为bf16,8机64卡
报如下warning,程序卡了很久以后会自动kill:
Train: 0%| | 0/2415 [00:00<?, ?it/s][W compiler_depend.ts:26] Warning: Warning: kernel [ArgSort] can not support dtype int32 or int64 on AiCore, Now this kernel is running on AiCpu.If you are more concerned about high-performance execution,please cast dtype to float32. (function operator())
[E compiler_depend.ts:421] [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
what(): [Rank 20] HCCL watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800962 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 54] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800066 milliseconds before timing out.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[INFO:swift] The logging file will be saved in: /work/share/tyy/jxr/deepseek_output/v5-20250108-170417/logging.jsonl
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
在升腾910b2,swift代码v3.0.1分支跑deepseek-v3 lora微调,切分策略是zero3_offload,权重转换为bf16,8机64卡
报如下warning,程序卡了很久以后会自动kill:
Train: 0%| | 0/2415 [00:00<?, ?it/s][W compiler_depend.ts:26] Warning: Warning: kernel [ArgSort] can not support dtype int32 or int64 on AiCore, Now this kernel is running on AiCpu.If you are more concerned about high-performance execution,please cast dtype to float32. (function operator())
[E compiler_depend.ts:421] [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 19] HCCL watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800278 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 17] HCCL watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362248, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
[E compiler_depend.ts:475] Some HCCL operations have failed or timed out. Due to the asynchronous nature of ASCEND kernels, subsequent NPU operations might run on corrupted/incomplete data.
[E compiler_depend.ts:481] To avoid data inconsistency, we are taking the entire process down.
[E compiler_depend.ts:805] [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 21] HCCL watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800318 milliseconds before timing out.
what(): [Rank 20] HCCL watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800962 milliseconds before timing out.
[E compiler_depend.ts:421] [Rank 54] Watchdog caught collective operation timeout: WorkHCCL(SeqNum=362249, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800066 milliseconds before timing out.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[INFO:swift] The logging file will be saved in: /work/share/tyy/jxr/deepseek_output/v5-20250108-170417/logging.jsonl
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
[ERROR:modelscope] The request model: unknown does not exist!
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
ubuntu20.04
transformers 4.37.2
deepspeed 0.14.4
accelerate 0.28.0
Additional context
启动脚本:
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export GPU_NUM_PER_NODE=8
output_dir="/deepseek_output/"
Algorithm_LOG="/ms-swift-3.0.1/deepseek.log"
torchrun $DISTRIBUTED_ARGS ./ms-swift-3.0.1/swift/cli/sft.py
--model .../weights/DeepSeek-V3-BF16-metadata/
--model_type deepseek_v2_5
--dataset AI-ModelScope/alpaca-gpt4-data-en
--train_type lora
--output_dir $output_dir
--deepspeed zero3_offload
2>&1 | tee $Algorithm_LOG
The text was updated successfully, but these errors were encountered: