-
Notifications
You must be signed in to change notification settings - Fork 257
Open
Description
Hi,
this is the error I'm getting trying to run h100.cu with external/ThunderKittens/kernels/attn/h100/h100_check.py
no matter old commits or latest, for sm_90a on H200 - I get the same error, tests pass for Grid(1,.. but fail for >1
Test 7/15: 32 heads, 384 seq
Parameters: b=1 heads=32 seq=384 head_dim=128 causal=False
Grid: (2, 32, 1)
✗ ERROR: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
that comes down to (this obtained from custom .py wrapper but error is same)
========= COMPUTE-SANITIZER
========= Unknown Error
========= at kittens::tma::expect_bytes(kittens::semaphore &, unsigned int)+0x1a0 in tma.cuh:32
========= by thread (384,0,0) in block (0,7,1)
========= Device Frame: void fwd_attend_ker<(int)64, (bool)0>(fwd_globals<T1>)+0x37e0 in h100.cu:160
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: void fwd_attend_ker<64, false>(fwd_globals<64>) [0x75151] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: attention_forward(at::Tensor, at::Tensor, at::Tensor, bool) in h100.cu:807 [0x63c9b] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > std::__invoke_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::__invoke_other, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:61 [0xac268] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: std::__invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::__invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:97 [0xa77c7] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: std::invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in functional:114 [0xa2096] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}::operator()(at::Tensor, at::Tensor, at::Tensor, bool) const in Exceptions.h:366 [0x9d04d] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && in cast.h:1631 [0xb4bd4] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: std::enable_if<!std::is_void<std::vector<at::Tensor, std::allocator<at::Tensor> > >::value, std::vector<at::Tensor, std::allocator<at::Tensor> > >::type pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call<std::vector<at::Tensor, std::allocator<at::Tensor> >, pybind11::detail::void_type, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&) && in cast.h:1600 [0xb1696] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const in pybind11.h:278 [0xac444] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) in pybind11.h:249 [0xac7b1] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) in pybind11.h:971 [0x9adda] in thunderkittens.cpython-313-x86_64-linux-gnu.so
========= Host Frame: cfunction_call in methodobject.c:539 [0x197a4c] in python3
========= Host Frame: _PyObject_MakeTpCall in call.c:242 [0x15af83] in python3
========= Host Frame: _PyEval_EvalFrameDefault in generated_cases.c.h:813 [0x172615] in python3
========= Host Frame: PyEval_EvalCode in ceval.c:604 [0x28e4d7] in python3
========= Host Frame: run_mod in pythonrun.c:1489 [0x2281f7] in python3
========= Host Frame: pyrun_file in pythonrun.c:1295 [0x2c366f] in python3
========= Host Frame: _PyRun_SimpleFileObject in pythonrun.c:517 [0x2c332a] in python3
========= Host Frame: _PyRun_AnyFileObject in pythonrun.c:77 [0x2c315a] in python3
========= Host Frame: Py_RunMain in main.c:775 [0x2c1766] in python3
========= Host Frame: Py_BytesMain in main.c:829 [0x281d78] in python3
========= Host Frame: [0x2a1c9] in libc.so.6
========= Host Frame: __libc_start_main [0x2a28a] in libc.so.6
========= Host Frame: _start [0x280fbd] in python3
========= Host Frame: h100_fwd_kernel_test in h100_check.py:122
========= Host Frame: check_correctness in h100_check.py:153
========= Host Frame: generate_error_graphs in h100_check.py:193
========= Host Frame: <module> in h100_check.py:243
with debug and line info it comes down to
tma::expect_bytes(...
I'm sure it somehow works for you but, can't make it work no matter changes to h100.cu are minimal - just warp:: namespace preffix added to missing symbols. Thanx
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels