Skip to content

pipeline/barrier error on sm90a #164

@vhnatyk

Description

@vhnatyk

Hi,

this is the error I'm getting trying to run h100.cu with external/ThunderKittens/kernels/attn/h100/h100_check.py
no matter old commits or latest, for sm_90a on H200 - I get the same error, tests pass for Grid(1,.. but fail for >1

Test 7/15: 32 heads, 384 seq
  Parameters: b=1 heads=32 seq=384 head_dim=128 causal=False
  Grid: (2, 32, 1)
  ✗ ERROR: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

that comes down to (this obtained from custom .py wrapper but error is same)

========= COMPUTE-SANITIZER
========= Unknown Error
=========     at kittens::tma::expect_bytes(kittens::semaphore &, unsigned int)+0x1a0 in tma.cuh:32
=========     by thread (384,0,0) in block (0,7,1)
=========         Device Frame: void fwd_attend_ker<(int)64, (bool)0>(fwd_globals<T1>)+0x37e0 in h100.cu:160
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: void fwd_attend_ker<64, false>(fwd_globals<64>) [0x75151] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: attention_forward(at::Tensor, at::Tensor, at::Tensor, bool) in h100.cu:807 [0x63c9b] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > std::__invoke_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::__invoke_other, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:61 [0xac268] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::__invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::__invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:97 [0xa77c7] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in functional:114 [0xa2096] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}::operator()(at::Tensor, at::Tensor, at::Tensor, bool) const in Exceptions.h:366 [0x9d04d] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && in cast.h:1631 [0xb4bd4] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::enable_if<!std::is_void<std::vector<at::Tensor, std::allocator<at::Tensor> > >::value, std::vector<at::Tensor, std::allocator<at::Tensor> > >::type pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call<std::vector<at::Tensor, std::allocator<at::Tensor> >, pybind11::detail::void_type, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&) && in cast.h:1600 [0xb1696] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const in pybind11.h:278 [0xac444] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) in pybind11.h:249 [0xac7b1] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) in pybind11.h:971 [0x9adda] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: cfunction_call in methodobject.c:539 [0x197a4c] in python3
=========         Host Frame: _PyObject_MakeTpCall in call.c:242 [0x15af83] in python3
=========         Host Frame: _PyEval_EvalFrameDefault in generated_cases.c.h:813 [0x172615] in python3
=========         Host Frame: PyEval_EvalCode in ceval.c:604 [0x28e4d7] in python3
=========         Host Frame: run_mod in pythonrun.c:1489 [0x2281f7] in python3
=========         Host Frame: pyrun_file in pythonrun.c:1295 [0x2c366f] in python3
=========         Host Frame: _PyRun_SimpleFileObject in pythonrun.c:517 [0x2c332a] in python3
=========         Host Frame: _PyRun_AnyFileObject in pythonrun.c:77 [0x2c315a] in python3
=========         Host Frame: Py_RunMain in main.c:775 [0x2c1766] in python3
=========         Host Frame: Py_BytesMain in main.c:829 [0x281d78] in python3
=========         Host Frame:  [0x2a1c9] in libc.so.6
=========         Host Frame: __libc_start_main [0x2a28a] in libc.so.6
=========         Host Frame: _start [0x280fbd] in python3
=========         Host Frame: h100_fwd_kernel_test in h100_check.py:122
=========         Host Frame: check_correctness in h100_check.py:153
=========         Host Frame: generate_error_graphs in h100_check.py:193
=========         Host Frame: <module> in h100_check.py:243

with debug and line info it comes down to
tma::expect_bytes(...

I'm sure it somehow works for you but, can't make it work no matter changes to h100.cu are minimal - just warp:: namespace preffix added to missing symbols. Thanx

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions