pipeline/barrier error on sm90a

Hi, 

this is the error I'm getting trying to run `h100.cu` with [ external/ThunderKittens/kernels/attn/h100/h100_check.py](https://github.com/HazyResearch/ThunderKittens/blob/main/kernels/attn/h100/h100_check.py)
no matter old commits or latest, for `sm_90a` on H200 - I get the same error, tests pass for` Grid(1,..` but fail for >1
```
Test 7/15: 32 heads, 384 seq
  Parameters: b=1 heads=32 seq=384 head_dim=128 causal=False
  Grid: (2, 32, 1)
  ✗ ERROR: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
that comes down to (this obtained from custom .py wrapper but error is same)
```
========= COMPUTE-SANITIZER
========= Unknown Error
=========     at kittens::tma::expect_bytes(kittens::semaphore &, unsigned int)+0x1a0 in tma.cuh:32
=========     by thread (384,0,0) in block (0,7,1)
=========         Device Frame: void fwd_attend_ker<(int)64, (bool)0>(fwd_globals<T1>)+0x37e0 in h100.cu:160
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: void fwd_attend_ker<64, false>(fwd_globals<64>) [0x75151] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: attention_forward(at::Tensor, at::Tensor, at::Tensor, bool) in h100.cu:807 [0x63c9b] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > std::__invoke_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::__invoke_other, std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:61 [0xac268] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::__invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::__invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in invoke.h:97 [0xa77c7] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::invoke_result<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>::type std::invoke<std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor, at::Tensor, at::Tensor, bool>(std::vector<at::Tensor, std::allocator<at::Tensor> > (* const&)(at::Tensor, at::Tensor, at::Tensor, bool), at::Tensor&&, at::Tensor&&, at::Tensor&&, bool&&) in functional:114 [0xa2096] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}::operator()(at::Tensor, at::Tensor, at::Tensor, bool) const in Exceptions.h:366 [0x9d04d] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::vector<at::Tensor, std::allocator<at::Tensor> > pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && in cast.h:1631 [0xb4bd4] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: std::enable_if<!std::is_void<std::vector<at::Tensor, std::allocator<at::Tensor> > >::value, std::vector<at::Tensor, std::allocator<at::Tensor> > >::type pybind11::detail::argument_loader<at::Tensor, at::Tensor, at::Tensor, bool>::call<std::vector<at::Tensor, std::allocator<at::Tensor> >, pybind11::detail::void_type, torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&>(torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}&) && in cast.h:1600 [0xb1696] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const in pybind11.h:278 [0xac444] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::initialize<torch::detail::wrap_pybind_function_impl_<std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), 0ul, 1ul, 2ul, 3ul, false>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, std::integral_constant<bool, false>)::{lambda(at::Tensor, at::Tensor, at::Tensor, bool)#1}, std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor, at::Tensor, at::Tensor, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [192]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (&)(at::Tensor, at::Tensor, at::Tensor, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, at::Tensor, at::Tensor, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [192])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) in pybind11.h:249 [0xac7b1] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) in pybind11.h:971 [0x9adda] in thunderkittens.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: cfunction_call in methodobject.c:539 [0x197a4c] in python3
=========         Host Frame: _PyObject_MakeTpCall in call.c:242 [0x15af83] in python3
=========         Host Frame: _PyEval_EvalFrameDefault in generated_cases.c.h:813 [0x172615] in python3
=========         Host Frame: PyEval_EvalCode in ceval.c:604 [0x28e4d7] in python3
=========         Host Frame: run_mod in pythonrun.c:1489 [0x2281f7] in python3
=========         Host Frame: pyrun_file in pythonrun.c:1295 [0x2c366f] in python3
=========         Host Frame: _PyRun_SimpleFileObject in pythonrun.c:517 [0x2c332a] in python3
=========         Host Frame: _PyRun_AnyFileObject in pythonrun.c:77 [0x2c315a] in python3
=========         Host Frame: Py_RunMain in main.c:775 [0x2c1766] in python3
=========         Host Frame: Py_BytesMain in main.c:829 [0x281d78] in python3
=========         Host Frame:  [0x2a1c9] in libc.so.6
=========         Host Frame: __libc_start_main [0x2a28a] in libc.so.6
=========         Host Frame: _start [0x280fbd] in python3
=========         Host Frame: h100_fwd_kernel_test in h100_check.py:122
=========         Host Frame: check_correctness in h100_check.py:153
=========         Host Frame: generate_error_graphs in h100_check.py:193
=========         Host Frame: <module> in h100_check.py:243
```

with debug and line info it comes down to
[tma::expect_bytes(...](https://github.com/HazyResearch/ThunderKittens/blob/50ee1f0aac02768c859b4b5f0a1cf84b311a922d/kernels/attn/h100/h100.cu#L118)

I'm sure it somehow works for you but, can't make it work no matter changes to` h100.cu` are minimal - just `warp:: `namespace preffix added to missing symbols. Thanx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline/barrier error on sm90a #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pipeline/barrier error on sm90a #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions