Skip to content

Ovis-9B模型在做GRPO微调时出现数据溢出问题 #6152

@ShaochengShen

Description

@ShaochengShen

目前在对Ovis-9B模型进行GRPO,刚开始运行就会报错:
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0cdc16c446 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0cdc1166e4 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0cdc611a18 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1f92e (0x7f0cdc5d892e in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x20a57 (0x7f0cdc5d9a57 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x20c5f (0x7f0cdc5d9c5f in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x5faf70 (0x7f0cdaf9af70 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x6f69f (0x7f0cdc14d69f in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x40 (0x7f0cdc146100 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f0cdc1465b4 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: + 0x56109f8 (0x7f0ccb37d9f8 in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: THPVariable_set_data(THPVariable*, _object*, void*) + 0x8b (0x7f0cdb25e9bb in /root/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x228b07 (0x56361e3b2b07 in /root/miniconda3/envs/swift/bin/python3.10)
frame #13: PyObject_SetAttr + 0x836 (0x56361e35a4e6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #14: _PyEval_EvalFrameDefault + 0xd56 (0x56361e36e7c6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #15: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #16: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #17: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #18: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #19: _PyEval_EvalFrameDefault + 0x304 (0x56361e36dd74 in /root/miniconda3/envs/swift/bin/python3.10)
frame #20: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #21: PyObject_Call + 0xb8 (0x56361e2d3028 in /root/miniconda3/envs/swift/bin/python3.10)
frame #22: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #23: + 0x1c7a01 (0x56361e351a01 in /root/miniconda3/envs/swift/bin/python3.10)
frame #24: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #25: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #26: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #27: _PyFunction_Vectorcall + 0x9eb (0x56361e35094b in /root/miniconda3/envs/swift/bin/python3.10)
frame #28: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)
frame #29: + 0x1c7dec (0x56361e351dec in /root/miniconda3/envs/swift/bin/python3.10)
frame #30: + 0x1ab505 (0x56361e335505 in /root/miniconda3/envs/swift/bin/python3.10)
frame #31: _PyEval_EvalFrameDefault + 0x37f5 (0x56361e371265 in /root/miniconda3/envs/swift/bin/python3.10)
frame #32: + 0x26a30c (0x56361e3f430c in /root/miniconda3/envs/swift/bin/python3.10)
frame #33: + 0x26a589 (0x56361e3f4589 in /root/miniconda3/envs/swift/bin/python3.10)
frame #34: + 0x26c8a6 (0x56361e3f68a6 in /root/miniconda3/envs/swift/bin/python3.10)
frame #35: + 0x26caca (0x56361e3f6aca in /root/miniconda3/envs/swift/bin/python3.10)
frame #36: + 0x19c14d (0x56361e32614d in /root/miniconda3/envs/swift/bin/python3.10)
frame #37: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #38: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #39: + 0x1ab505 (0x56361e335505 in /root/miniconda3/envs/swift/bin/python3.10)
frame #40: _PyEval_EvalFrameDefault + 0x37f5 (0x56361e371265 in /root/miniconda3/envs/swift/bin/python3.10)
frame #41: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #42: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #43: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #44: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #45: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #46: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #47: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #48: _PyFunction_Vectorcall + 0x25d (0x56361e3501bd in /root/miniconda3/envs/swift/bin/python3.10)
frame #49: PyObject_Call + 0x1aa (0x56361e2d311a in /root/miniconda3/envs/swift/bin/python3.10)
frame #50: _PyEval_EvalFrameDefault + 0x2c0b (0x56361e37067b in /root/miniconda3/envs/swift/bin/python3.10)
frame #51: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #52: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #53: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #54: _PyEval_EvalFrameDefault + 0x49c9 (0x56361e372439 in /root/miniconda3/envs/swift/bin/python3.10)
frame #55: _PyFunction_Vectorcall + 0x798 (0x56361e3506f8 in /root/miniconda3/envs/swift/bin/python3.10)
frame #56: _PyEval_EvalFrameDefault + 0x60b (0x56361e36e07b in /root/miniconda3/envs/swift/bin/python3.10)
frame #57: + 0x1c73a5 (0x56361e3513a5 in /root/miniconda3/envs/swift/bin/python3.10)
frame #58: PyObject_Call + 0xb8 (0x56361e2d3028 in /root/miniconda3/envs/swift/bin/python3.10)
frame #59: + 0x272c38 (0x56361e3fcc38 in /root/miniconda3/envs/swift/bin/python3.10)
frame #60: _PyObject_MakeTpCall + 0x1ea (0x56361e2ccafa in /root/miniconda3/envs/swift/bin/python3.10)
frame #61: + 0x269e1a (0x56361e3f3e1a in /root/miniconda3/envs/swift/bin/python3.10)
frame #62: _PyEval_EvalFrameDefault + 0x125a (0x56361e36ecca in /root/miniconda3/envs/swift/bin/python3.10)

../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
terminate called after throwing an instance of 'c10::Error'
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

数据上应该没有问题,之前在Qwen2.5VL和GLM4.1上进行GRPO微调时均不报错
训练脚本如下:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions