-
Notifications
You must be signed in to change notification settings - Fork 903
Description
目前在对Ovis-9B模型进行GRPO,刚开始运行就会报错:
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either
inf,
nan or element < 0
failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0cdc16c446 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0cdc1166e4 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0cdc611a18 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1f92e (0x7f0cdc5d892e in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x20a57 (0x7f0cdc5d9a57 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x20c5f (0x7f0cdc5d9c5f in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x5faf70 (0x7f0cdaf9af70 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x6f69f (0x7f0cdc14d69f in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x40 (0x7f0cdc146100 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f0cdc1465b4 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: + 0x56109f8 (0x7f0ccb37d9f8 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: THPVariable_set_data(THPVariable*, _object*, void*) + 0x8b (0x7f0cdb25e9bb in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x228b07 (0x56361e3b2b07 in /root/miniconda3/envs/swift/bin/python3.10)
frame #13: PyObject_SetAttr + 0x836 (0x56361e35a4e6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #14: _PyEval_EvalFrameDefault + 0xd56 (0x56361e36e7c6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #15: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #16: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #17: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #18: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #19: _PyEval_EvalFrameDefault + 0x304 (0x56361e36dd74 in /root/miniconda3/envs/swift/bin/python3.10)
frame #20: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #21: PyObject_Call + 0xb8 (0x56361e2d3028 in /root/miniconda3/envs/swift/bin/python3.10)
frame #22: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #23: + 0x1c7a01 (0x56361e351a01 in /root/miniconda3/envs/swift/bin/python3.10)
frame #24: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #25: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #26: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #27: _PyFunction_Vectorcall + 0x9eb (0x56361e35094b in /root/miniconda3/envs/swift/bin/python3.10)
frame #28: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #29: + 0x1c7dec (0x56361e351dec in /root/miniconda3/envs/swift/bin/python3.10)
frame #30: + 0x1ab505 (0x56361e335505 in /root/miniconda3/envs/swift/bin/python3.10)
frame #31: _PyEval_EvalFrameDefault + 0x37f5 (0x56361e371265 in /root/miniconda3/envs/swift/bin/python3.10)
frame #32: + 0x26a30c (0x56361e3f430c in /root/miniconda3/envs/swift/bin/python3.10)
frame #33: + 0x26a589 (0x56361e3f4589 in /root/miniconda3/envs/swift/bin/python3.10)
frame #34: + 0x26c8a6 (0x56361e3f68a6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #35: + 0x26caca (0x56361e3f6aca in /root/miniconda3/envs/swift/bin/python3.10)
frame #36: + 0x19c14d (0x56361e32614d in /root/miniconda3/envs/swift/bin/python3.10)
frame #37: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #38: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #39: + 0x1ab505 (0x56361e335505 in /root/miniconda3/envs/swift/bin/python3.10)
frame #40: _PyEval_EvalFrameDefault + 0x37f5 (0x56361e371265 in /root/miniconda3/envs/swift/bin/python3.10)
frame #41: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #42: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #43: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #44: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #45: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #46: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #47: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #48: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #49: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #50: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #51: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #52: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #53: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #54: _PyEval_EvalFrameDefault + 0x49c9 (0x56361e372439 in /root/miniconda3/envs/swift/bin/python3.10)
frame #55: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #56: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #57: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #58: PyObject_Call + 0xb8 (0x56361e2d3028 in /root/miniconda3/envs/swift/bin/python3.10)
frame #59: + 0x272c38 (0x56361e3fcc38 in /root/miniconda3/envs/swift/bin/python3.10)
frame #60: _PyObject_MakeTpCall + 0x1ea (0x56361e2ccafa in /root/miniconda3/envs/swift/bin/python3.10)
frame #61: + 0x269e1a (0x56361e3f3e1a in /root/miniconda3/envs/swift/bin/python3.10)
frame #62: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either
inf,
nan or element < 0
failed.
terminate called after throwing an instance of 'c10::Error'
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either
inf,
nan or element < 0
failed.
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
数据上应该没有问题,之前在Qwen2.5VL和GLM4.1上进行GRPO微调时均不报错
训练脚本如下:
