[Refactor] Enhance `T.alloc_barrier` with new features and deprecate legacy mbarrier related intrinsics #1733

Rachmanino · 2026-01-26T13:53:22Z

Features

T.alloc_barrier creates and returns one or more mbarriers allocated in one smem buffer
Plan to deprecate T.create_list_of_barrier and T.get_mbarrier
Other related operators (e.g. T.mbarrier_wait_parity) only accept explicit mbarriers (e.g. mbars[k]) as inputs

Summary by CodeRabbit

New Features
- alloc_barrier now accepts an int or list to allocate multiple barrier slots with per-slot arrive counts and stores per-barrier metadata.
API Changes
- Barrier-oriented operations now accept buffer/handle or indexed barrier arrays; older list/get helpers removed or internalized; some mbar-related signatures widened.
Documentation
- Guides and inline docs updated to use per-barrier allocation/initialization terminology.
Tests
- Added/updated tests validating shared-barrier allocation, indexed usages, gated initialization, and emitted init calls.
Refactor
- Unified barrier handling and guarded initialization flow across transforms and backends.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-26T13:53:32Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-26T13:53:47Z

📝 Walkthrough

Walkthrough

Barrier handling moved from list-based helpers to buffer-backed allocations: T.alloc_barrier records per-barrier arrival counts in a barrier_init block attribute; lowering consumes barrier_init to emit per-element ptx_init_barrier_thread_count calls, inserts a fence, and wraps the block with barrier Allocate; codegen prints barriers via generic expression printers.

Changes

Cohort / File(s)	Summary
API & Language `docs/programming_guides/instructions.md`, `tilelang/language/allocate.py`, `tilelang/language/builtin.py`, `tilelang/language/gemm_op.py`, `tilelang/tileop/gemm/gemm_base.py`	`T.alloc_barrier` now accepts `int
Lowering / Transform `src/transform/lower_shared_barrier.cc`, `src/transform/lower_hopper_intrin.cc`, `src/transform/common/mbarrier.h`	Added `kBarrierInit` attr and `CreateMBarrierBuffer` helper; lowering reads `barrier_init`, validates counts vs buffer length, emits per-element `ptx_init_barrier_thread_count`, emits `ptx_fence_barrier_init`, and wraps statements with `DeclBuffer`/`Allocate` for barrier storage when present.
Codegen `src/target/codegen_cuda.cc`, `src/target/codegen_cutedsl.cc`, `src/target/codegen_cuda.h`	Removed local `print_mbarrier_obj` helpers; unified barrier printing via `PrintExpr`/`PrintExpr_`; added `__align__(barrier_alignment_bytes_)` for shared barrier storage and a clarifying comment in the get_mbarrier path.
Intrinsics / Ops / Headers `src/op/builtin.h`, `src/transform/common/mbarrier.h`	Updated inline docs for barrier intrinsics/get_mbarrier; introduced `CreateMBarrierBuffer` and `injected_mbarrier_name_`.
Tests `testing/python/transform/test_tilelang_transform_lower_shared_barrier.py`, `testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py`, `testing/python/transform/test_tilelang_transform_lower_hopper_intrin.py`, `testing/python/transform/test_tilelang_transform_warp_specialized.py`	Added test for `lower_shared_barrier`; updated tests to use `mbars = T.alloc_barrier(...)` and index `mbars[...]`; adjusted assertions and expected emitted init/fence sequences.
Examples `examples/minference/...`, `examples/warp_specialize/...`, `examples/warp_specialize/example_warp_specialize_gemm_barrierpipe_stage2.py`	Replaced list-based barrier creation with `mbars = T.alloc_barrier(...)` and updated barrier references to indexed `mbars[...]`.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant AllocAPI as Alloc API (T.alloc_barrier)
    participant BlockMeta as Block Metadata (barrier_init)
    participant Transform as lower_shared_barrier
    participant Codegen as CodeGen (CUDA/CUTEDSL)
    participant PTX as Emitted PTX

    User->>AllocAPI: call alloc_barrier([arrive_counts...])
    AllocAPI->>AllocAPI: normalize counts -> list
    AllocAPI->>AllocAPI: allocate barrier buffer (shared.barrier)
    AllocAPI->>BlockMeta: attach barrier_init (buffer.data -> IntImm counts)
    AllocAPI->>User: return barrier buffer

    Transform->>BlockMeta: read barrier_init attribute
    Transform->>Transform: validate counts == buffer length
    loop per barrier element
        Transform->>PTX: emit ptx_init_barrier_thread_count(buffer[i], arrive_count[i])
    end
    Transform->>PTX: emit ptx_fence_barrier_init()
    Codegen->>Codegen: render barrier buffer/index via PrintExpr
    Codegen->>PTX: output barrier calls/identifiers

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

[Refactor] Refactor barrier management #744: Related refactor of barrier management to buffer-backed allocations and API updates.
[Enhancement] Add missing fence_barrier_init primitive after mbarrier init #1121: Related addition/insertion of ptx_fence_barrier_init in lowering sequences.
[Example] Add vertical slash sparse attention pattern #762: Overlapping example updates replacing create_list_of_mbarrier with alloc_barrier and indexing.

Suggested labels

experimental

Suggested reviewers

LeiWang1999

Poem

🐰 I stack the counts in tidy rows,

Shared memory burrows hold the gates,
Per-barrier calls set threads to close,
A gentle fence ensures no straggling waits,
Hopping kernels sync — and off it goes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main refactoring effort: enhancing T.alloc_barrier with new features while deprecating legacy mbarrier APIs.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/programming_guides/instructions.md (1)
103-109: Fix alloc_barrier syntax typo and clarify list form.

There’s a missing bracket and the signature should reflect the int-or-list form.
✏️ Proposed doc fix
-- `T.alloc_barrier([arrive_count)`: Allocate and initialize one or more mbarriers.
+- `T.alloc_barrier(arrive_count | [arrive_count])`: Allocate and initialize one or more mbarriers.

🤖 Fix all issues with AI agents

In `@testing/python/transform/test_tilelang_transform_lower_shared_barrier.py`:
- Around line 19-21: The local variable barriers from the T.alloc_barrier(mbars)
call is unused and triggers Ruff F841; silence it by either renaming the
variable to a conventional unused name (e.g., _barriers) where T.alloc_barrier
is called or append a per-line ignore (e.g., "# noqa: F841") to that allocation
line so the linter no longer flags the unused binding; update the allocation
site where barriers is created to apply one of these fixes.

In `@tilelang/language/builtin.py`:
- Around line 31-46: The docstring for _mbar_to_buffer_load incorrectly
annotates the parameter as Optional[Buffer | BufferLoad] while the function
signature declares mbar: Buffer | BufferLoad; update the docstring to match the
signature (remove "Optional" and use "mbar: Buffer | BufferLoad") or
alternatively change the function signature to accept Optional[Buffer |
BufferLoad] and handle None accordingly; adjust the Args and Returns text to
reference the same types and ensure the raised TypeError message in
_mbar_to_buffer_load remains accurate for the chosen option.

🧹 Nitpick comments (3)

testing/python/transform/test_tilelang_transform_lower_shared_barrier.py (1)
40-44: Prefer structural assertions over literal barrier counts.

The test currently asserts exact numeric literals in generated code. To make it robust, assert the structure (presence of barrier init calls/count) instead of specific values.
♻️ Proposed refactor
+import re
@@
 def test_lower_shared_barrier():
     mbars = (1, 1, 128, 128)  # list is unhashable so we use tuple here
     kernel = matmul(1024, 1024, 1024, 128, 128, 32, mbars=mbars)
-   
-    assert f"__shared__ uint64_t barriers_mem[{len(mbars)}];" in kernel.get_kernel_source()
-    assert "if (tl::tl_shuffle_elect<0>()) {" in kernel.get_kernel_source()
-    for i in range(len(mbars)):
-        assert f"barriers[{i}].init({mbars[i]});" in kernel.get_kernel_source()
-    print(kernel.get_kernel_source())
+    src = kernel.get_kernel_source()
+    assert re.search(r"__shared__ uint64_t .*_mem\[\d+\];", src)
+    assert "if (tl::tl_shuffle_elect<0>()) {" in src
+    assert len(re.findall(r"\.init\(", src)) == len(mbars)
+    print(src)
Based on learnings, prefer structural source checks over literal numeric values in transform tests.
tilelang/language/allocate.py (1)
152-177: Add input validation to alloc_barrier for better error messages.

While all current call sites use valid inputs, adding validation for empty lists and non-positive/non-integer entries would catch errors earlier with clearer messages, rather than failing cryptically in downstream codegen.
♻️ Proposed validation
     if isinstance(arrive_count, int):
         arrive_count = [arrive_count]
     else:
         arrive_count = list(arrive_count)
+    if not arrive_count:
+        raise ValueError("alloc_barrier requires at least one arrive_count")
+    if any((not isinstance(c, int)) or c <= 0 for c in arrive_count):
+        raise TypeError("alloc_barrier arrive_count entries must be positive ints")
src/transform/lower_shared_barrier.cc (1)
87-103: Defensive hardening: guard IntImmNode cast for clarity (optional).

The code assumes buffer->shape[0] is always a constant IntImm, which is guaranteed by the current API design (all barriers come from alloc_barrier(), which hardcodes shape as a Python list length). However, adding an explicit null check makes the intent clearer and prevents crashes if shape access patterns change in the future. The barrier_init annotation is reliably set by alloc_barrier() for all shared.barrier allocations, so coverage is complete.
♻️ Proposed defensive check
-      ICHECK(arrive_counts.size() == static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
+      const auto* size_imm = buffer->shape[0].as<IntImmNode>();
+      ICHECK(size_imm) << "Barrier buffer size must be a constant IntImm.";
+      ICHECK(arrive_counts.size() == static_cast<size_t>(size_imm->value))
           << "The number of arrive counts (" << arrive_counts.size() 
           << ") must match the barrier buffer size (" << buffer->shape[0] << ") for buffer " << buffer->name;

testing/python/transform/test_tilelang_transform_lower_shared_barrier.py

tilelang/language/builtin.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/transform/lower_shared_barrier.cc`:
- Around line 103-113: Add a precondition check before dereferencing
buffer->shape[0].as<IntImmNode>(): verify that buffer->shape is non-empty and
that buffer->shape[0].as<IntImmNode>() is non-null using ICHECK with a clear
message mentioning buffer->name (e.g., "barrier buffer shape must be a constant
IntImm for buffer <name>"). Then use the checked IntImmNode->value for the size
comparison with arrive_counts; keep existing barrier_init_map and arrive_counts
logic unchanged.

In `@tilelang/language/builtin.py`:
- Around line 41-45: The helper currently converts a tir.Buffer to
tir.BufferLoad by defaulting to index 0 (when mbar is a tir.Buffer), which
silently misroutes multi-element buffers; update the logic around mbar handling
(where you check isinstance(mbar, tir.BufferLoad) / isinstance(mbar,
tir.Buffer)) to detect the buffer's shape/length and only auto-create
tir.BufferLoad(mbar, [0]) when the buffer is statically size 1; otherwise raise
an explicit error (or assert) requiring the caller to pass an indexed BufferLoad
(e.g., mbars[i]) so multi-element buffers are not implicitly mapped to element
0. Ensure you reference mbar, tir.Buffer, and tir.BufferLoad in the check and
error message.

♻️ Duplicate comments (1)

tilelang/language/builtin.py (1)

31-39: Docstring still marks mbar as Optional.
The signature is non-optional; please align docstring with the actual type.

🧹 Nitpick comments (1)

tilelang/language/allocate.py (1)

152-176: Validate arrive_count values before emitting barrier_init.
Right now any list content (including 0/negative or non-int) will be encoded into IntImm and can produce invalid mbarrier init or confusing errors later. Add basic validation up front.

♻️ Suggested guardrails

 def alloc_barrier(arrive_count: int | list[int]):
@@
     if isinstance(arrive_count, int):
         arrive_count = [arrive_count]
     else:
         arrive_count = list(arrive_count)
+    if not arrive_count:
+        raise ValueError("alloc_barrier expects at least one barrier.")
+    for c in arrive_count:
+        if not isinstance(c, int):
+            raise TypeError(f"arrive_count entries must be int, got {type(c)}.")
+        if c <= 0:
+            raise ValueError(f"arrive_count must be > 0, got {c}.")
     buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")

coderabbitai · 2026-01-26T14:20:26Z

src/transform/lower_shared_barrier.cc

    for (auto buffer : barrier_buffers) {
      auto data = buffer->data;
-      auto old_buffer = buffer_data_to_buffer_.at(data);
-      auto new_buffer = buffer_remap_.at(old_buffer);
-      auto count = old_buffer->shape[0];
-
-      auto call =
-          Call(DataType::Handle(), builtin::ptx_init_barrier_thread_count(),
-               {BufferLoad(new_buffer, {0}), PrimExpr(count)});
-      init_mbarrier_calls_.push_back(Evaluate(call));
+      ICHECK(barrier_init_map.count(data))
+          << "Barrier buffer " << buffer->name
+          << " not found in barrier_init annotation";
+      auto arrive_counts = barrier_init_map.at(data);
+      ICHECK(arrive_counts.size() ==
+             static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
+          << "The number of arrive counts (" << arrive_counts.size()
+          << ") must match the barrier buffer size (" << buffer->shape[0]
+          << ") for buffer " << buffer->name;


⚠️ Potential issue | 🟠 Major

Guard against non-constant barrier buffer sizes (null deref).
buffer->shape[0].as<IntImmNode>()->value will crash if the shape isn’t an IntImm. Add an ICHECK and a clear error message before dereferencing.

🐛 Safer size check

auto arrive_counts = barrier_init_map.at(data); - ICHECK(arrive_counts.size() == - static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value)) + auto* shape0 = buffer->shape[0].as<IntImmNode>(); + ICHECK(shape0) << "Barrier buffer size must be a constant IntImm."; + ICHECK(arrive_counts.size() == static_cast<size_t>(shape0->value)) << "The number of arrive counts (" << arrive_counts.size()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (auto buffer : barrier_buffers) {

auto data = buffer->data;

auto old_buffer = buffer_data_to_buffer_.at(data);

auto new_buffer = buffer_remap_.at(old_buffer);

auto count = old_buffer->shape[0];

auto call =

Call(DataType::Handle(), builtin::ptx_init_barrier_thread_count(),

{BufferLoad(new_buffer, {0}), PrimExpr(count)});

init_mbarrier_calls_.push_back(Evaluate(call));

ICHECK(barrier_init_map.count(data))

<< "Barrier buffer " << buffer->name

<< " not found in barrier_init annotation";

auto arrive_counts = barrier_init_map.at(data);

ICHECK(arrive_counts.size() ==

static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))

<< "The number of arrive counts (" << arrive_counts.size()

<< ") must match the barrier buffer size (" << buffer->shape[0]

<< ") for buffer " << buffer->name;

for (auto buffer : barrier_buffers) {

auto data = buffer->data;

ICHECK(barrier_init_map.count(data))

<< "Barrier buffer " << buffer->name

<< " not found in barrier_init annotation";

auto arrive_counts = barrier_init_map.at(data);

auto* shape0 = buffer->shape[0].as<IntImmNode>();

ICHECK(shape0) << "Barrier buffer size must be a constant IntImm.";

ICHECK(arrive_counts.size() == static_cast<size_t>(shape0->value))

<< "The number of arrive counts (" << arrive_counts.size()

<< ") must match the barrier buffer size (" << buffer->shape[0]

<< ") for buffer " << buffer->name;

}

🤖 Prompt for AI Agents

In `@src/transform/lower_shared_barrier.cc` around lines 103 - 113, Add a precondition check before dereferencing buffer->shape[0].as<IntImmNode>(): verify that buffer->shape is non-empty and that buffer->shape[0].as<IntImmNode>() is non-null using ICHECK with a clear message mentioning buffer->name (e.g., "barrier buffer shape must be a constant IntImm for buffer <name>"). Then use the checked IntImmNode->value for the size comparison with arrive_counts; keep existing barrier_init_map and arrive_counts logic unchanged.

tilelang/language/builtin.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (1)
137-139: Second test not executed when running directly.

The __main__ block only calls test_inject_set_max_nreg() but not test_inject_set_max_nreg_no_set_max_nreg(). Consider adding the second test call or uncommenting tilelang.testing.main() to run all tests.
🔧 Proposed fix
 if __name__ == "__main__":
-    # tilelang.testing.main()
-    test_inject_set_max_nreg()
+    tilelang.testing.main()
Or if manual calls are preferred:
 if __name__ == "__main__":
     # tilelang.testing.main()
     test_inject_set_max_nreg()
+    test_inject_set_max_nreg_no_set_max_nreg()

🧹 Nitpick comments (2)

docs/programming_guides/instructions.md (1)
157-160: Add a short note about reserved barrier IDs for ID-based helpers.
Since the list still mentions T.barrier_wait(id, ...), consider warning users that IDs 1–2 are reserved for internal use when they rely on ID-based helpers. Based on learnings, consider adding a brief note here.
💡 Suggested doc tweak
- - Barriers: `T.alloc_barrier(arrive_count)`.
- - Parity ops: `T.mbarrier_wait_parity(barrier, parity)`, `T.mbarrier_arrive(barrier)`.
- - Expect tx: `T.mbarrier_expect_tx(...)`; sugar: `T.barrier_wait(id, parity=None)`.
+ - Barriers: `T.alloc_barrier(arrive_count)`.
+ - Parity ops: `T.mbarrier_wait_parity(barrier, parity)`, `T.mbarrier_arrive(barrier)`.
+ - Expect tx: `T.mbarrier_expect_tx(...)`; sugar: `T.barrier_wait(id, parity=None)`.
+   Note: IDs 1 and 2 are reserved for internal barriers; for ID-based helpers, start from 3.
src/transform/common/mbarrier.h (1)

30-31: Resolve TODO: avoid injected mbarrier name collisions.
Using a fixed "mbarrier" name can clash with user-defined buffers; consider routing this through a unique prefix or a name-supply helper. Happy to help wire this up if you want.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tilelang/language/gemm_op.py (1)

48-55: Remove unnecessary legalize_arguments call on mbar or update its type annotation.

The mbar parameter has type tir.Buffer | tir.BufferLoad | None (line 32), which explicitly excludes tir.Var. However, line 55 calls legalize_arguments(mbar), a function designed to extract .buffer from let-bound variables. This is inconsistent with how A, B, and C are handled—those parameters accept tir.Var and justify the legalize_arguments call, but mbar does not.

If mbar should support let-bound variables, update its type annotation to match A, B, C: tir.Buffer | tir.BufferLoad | tir.Var | None. Otherwise, remove the legalize_arguments call since it serves no purpose and may obscure intent. The subsequent to_buffer_region(mbar) call at line 102 already has its own let-binding handling, so the early extraction via legalize_arguments is redundant in either case.

🤖 Fix all issues with AI agents

In `@testing/python/transform/test_tilelang_transform_warp_specialized.py`:
- Line 79: The test is assigning reserved barrier IDs (0-2) to user barriers;
locate the tl.get_mbarrier calls in test_tilelang_transform_warp_specialized.py
(the same pattern as the working example at tl.get_mbarrier(...) near the other
call) and change the expression k % 3 to k % 3 + 3 so user-defined barriers use
IDs starting at kFirstUsedBarrier (3); update both tl.get_mbarrier usages that
currently compute k % 3 to use k % 3 + 3.

In `@tilelang/language/gemm_op.py`:
- Line 207: The mbar property in gemm_base.py should be updated to return
"tir.Buffer | tir.BufferLoad | None" instead of only "tir.Buffer"; change the
type annotation on the mbar property getter to that union, ensure tir.BufferLoad
is imported/available in that module, and keep the existing
getattr(self.gemm_node, "mbar", None) usage so None is correctly accepted;
update any related docstring/type comments on the mbar property to match the new
annotation.

🧹 Nitpick comments (3)

testing/python/transform/test_tilelang_transform_lower_hopper_intrin.py (2)

36-37: Unused barrier variable in expected output.

The mbarrier variable is allocated but never referenced in the after function. It has # noqa: F841 to suppress the linter warning, but this may indicate that the expected transformed output is incomplete or should show the barrier buffer being used for subsequent operations.

Is this intentional for this specific test case, or should the expected output demonstrate barrier usage?

22-23: Disabled structural equality assertion reduces test coverage.

The tvm.ir.assert_structural_equal check is commented out with a TODO. This means the test currently only verifies that the transform runs without error, not that the output matches the expected after function.

Consider tracking this TODO with an issue or re-enabling the check once the underlying issue is resolved.

Would you like me to open an issue to track re-enabling this structural equality check?

testing/python/transform/test_tilelang_transform_warp_specialized.py (1)

21-22: Disabled structural equality assertion.

Similar to test_tilelang_transform_lower_hopper_intrin.py, the structural equality check is disabled. This reduces test effectiveness to only verifying that transforms execute without errors.

testing/python/transform/test_tilelang_transform_warp_specialized.py

tilelang/language/gemm_op.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@tilelang/language/allocate.py`:
- Around line 151-176: The alloc_barrier function must validate arrive_count
before allocating: ensure the normalized arrive_count list is non-empty and that
every element is an integer > 0 (reject negatives, zeros, or non-int values),
raising a clear ValueError when the checks fail; perform these validations
before creating buffer and before building arrive_count_exprs, then proceed to
convert items to IntImm and call block_attr as before (refer to symbols
alloc_barrier, arrive_count, arrive_count_exprs, buffer, block_attr).

In `@tilelang/language/builtin.py`:
- Around line 798-800: The cp_async_barrier_noinc function fails to normalize
its input like the other barrier helpers; update cp_async_barrier_noinc to call
_mbar_to_buffer_load on the incoming barrier before passing it to
tir.call_intrin so the intrinsic always receives a BufferLoad (mirror how
mbarrier_wait_parity, mbarrier_arrive, and mbarrier_expect_tx do it); locate
cp_async_barrier_noinc and replace the direct use of barrier with the result of
_mbar_to_buffer_load(barrier) when calling tir.call_intrin.

coderabbitai · 2026-01-27T08:04:51Z

tilelang/language/allocate.py

+def alloc_barrier(arrive_count: int | list[int]):
    """Allocate a barrier buffer.

    Args:
-        arrive_count (int): The number of threads that need to arrive at the barrier
+        arrive_count (int | list[int]): The number of threads that need to arrive at each barrier

    Returns:
        T.Buffer: A TVM buffer object allocated as a barrier
+
+    Examples
+    --------
+    >>> mbar = alloc_barrier(128)  # allocate a barrier with arrive count 128
+    >>> mbars = alloc_barrier([128] * n)  # allocate n barriers with the same arrive count 128
    """
-    return T.alloc_buffer([arrive_count], _dtypes.uint64, scope="shared.barrier")
+    # Normalize to list
+    if isinstance(arrive_count, int):
+        arrive_count = [arrive_count]
+    else:
+        arrive_count = list(arrive_count)
+    buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")
+    # Convert to TIR IntImm expressions for C++ pass to consume as Map<Var, Array<PrimExpr>>
+    # Use buffer.data as key to support multiple barrier buffer allocations
+    arrive_count_exprs = [IntImm("int32", c) for c in arrive_count]
+    block_attr({"barrier_init": {buffer.data: arrive_count_exprs}})
+
+    return buffer


⚠️ Potential issue | 🟡 Minor

Missing validation for edge cases in alloc_barrier.

The function doesn't validate that the arrive count values are positive or that the list is non-empty. Passing an empty list would create a zero-length buffer, and negative arrive counts would silently propagate invalid values.

🛡️ Proposed validation

def alloc_barrier(arrive_count: int | list[int]): ... # Normalize to list if isinstance(arrive_count, int): arrive_count = [arrive_count] else: arrive_count = list(arrive_count) + if not arrive_count: + raise ValueError("alloc_barrier requires at least one barrier (arrive_count cannot be empty)") + if any(c <= 0 for c in arrive_count): + raise ValueError("arrive_count values must be positive integers") buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")

🤖 Prompt for AI Agents

In `@tilelang/language/allocate.py` around lines 151 - 176, The alloc_barrier function must validate arrive_count before allocating: ensure the normalized arrive_count list is non-empty and that every element is an integer > 0 (reject negatives, zeros, or non-int values), raising a clear ValueError when the checks fail; perform these validations before creating buffer and before building arrive_count_exprs, then proceed to convert items to IntImm and call block_attr as before (refer to symbols alloc_barrier, arrive_count, arrive_count_exprs, buffer, block_attr).

tilelang/language/builtin.py

- Updated the `T.alloc_barrier` function to accept an array of arrive counts instead of individual thread counts. - Modified the `get_mbarrier` function documentation to clarify its purpose and usage. - Enhanced the `CreateMBarrierBuffer` function to create mbarrier buffers with shared storage scope. - Updated tests to reflect changes in mbarrier allocation and usage. - Adjusted comments in various files to improve clarity regarding mbarrier functionality.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/transform/lower_hopper_intrin.cc`:
- Around line 156-164: The pass LowerHopperIntrin is creating shared.barrier
allocations (via CreateMBarrierBuffer and mbarrier_buffer_ using
injected_mbarrier_name_ and init_mbarrier_calls_) without backend gating, which
causes codegen_cutedsl.cc to hit ICHECK(false) for "shared.barrier"; either gate
the LowerHopperIntrin application to CUDA-only at the Python pass registration
(so the pass only runs for CUDA targets) or modify CuTeDSL codegen to accept
"shared.barrier" allocations (add handling for the allocation scope in
codegen_cutedsl.cc instead of ICHECK(false)); locate the LowerHopperIntrin
invocation/registration in tilelang/transform/__init__.py and the
CreateMBarrierBuffer/mbarrier_buffer_ allocation block in lower_hopper_intrin.cc
to implement the gating, or update codegen_cutedsl.cc to implement
shared.barrier lowering.

🧹 Nitpick comments (1)

src/transform/common/mbarrier.h (1)

23-31: Avoid hard-coded injected barrier names.
injected_mbarrier_name_ = "mbarrier" can collide with user-defined buffers; consider generating a unique prefix/suffix (or exposing a helper to do so). Also consider explicitly including <string> or using constexpr std::string_view to avoid transitive-include reliance.

src/transform/lower_hopper_intrin.cc

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/minference/example_vertical_slash_sparse_attn.py (1)

138-179: Avoid reserved barrier IDs 1 and 2 in user code.

This example uses mbars[1] and mbars[2] (via indices 0..8). In Tilelang’s CUDA/CuTeDSL backends, barrier IDs 1 and 2 are reserved for internal use, which can cause synchronization conflicts. Please offset user barrier indices to start at 3 (e.g., allocate extra barriers and add a base offset).

Based on learnings, user-defined barriers should start from ID 3 to avoid internal conflicts.

🤖 Fix all issues with AI agents

In `@tilelang/language/gemm_op.py`:
- Around line 98-105: The current code converts mbar to a BufferRegion via
to_buffer_region (mbar = to_buffer_region(...)) which yields a BufferRegion that
cannot be used in tir.call_intrin; instead call the helper _mbar_to_buffer_load
to produce a BufferLoad PrimExpr (when mbar is not None) and assign that to
mbar, and keep the else branch producing tir.const(0, T.uint32); update the
place where tir.call_intrin is invoked to pass this BufferLoad (same variable
mbar) just like buffer_region_to_tile_region is used for A/B/C and similar to
mbarrier_wait_parity’s handling. Ensure you reference and import
_mbar_to_buffer_load if needed.

🧹 Nitpick comments (2)

src/transform/lower_hopper_intrin.cc (1)

156-164: Consider unique naming for injected barrier buffer.

Using a fixed "mbarrier" name risks collisions with user-defined buffers (not just shared.barrier). Consider minting a unique name or checking for name conflicts in the PrimFunc buffer map before injecting.
tilelang/language/builtin.py (1)
43-45: Shape check only validates dimensionality, not single-element constraint.

The assertion len(mbar.shape) == 1 verifies the buffer is 1D but doesn't check if shape[0] == 1. A past review suggested guarding against multi-element buffers being silently converted to barrier[0]. The current check allows a multi-element 1D buffer to pass through, potentially masking user errors.

Consider enhancing the validation:
🛡️ Suggested enhancement
     elif isinstance(mbar, tir.Buffer):
-        assert len(mbar.shape) == 1, f"mbarrier must be a single element buffer, but got {mbar.shape}"
+        if len(mbar.shape) != 1:
+            raise TypeError(f"mbarrier Buffer must be 1D, but got shape {mbar.shape}")
+        shape0 = mbar.shape[0]
+        if isinstance(shape0, tir.IntImm) and int(shape0) != 1:
+            raise TypeError(f"mbarrier Buffer has {int(shape0)} elements; use mbars[idx] for multi-element buffers")
         return tir.BufferLoad(mbar, [0])

coderabbitai · 2026-01-27T09:30:32Z

tilelang/language/gemm_op.py

+    if mbar is not None:
+        assert isinstance(mbar, (tir.Buffer, tir.BufferLoad)), (
+            f"mbar for tcgen5mma must be a tir.Buffer or tir.BufferLoad, but got {type(mbar)}"
+        )
+        mbar = to_buffer_region(mbar, access_type="rw")
+    else:
+        mbar = tir.const(0, T.uint32)
    C_coords = [r.min for r in C_region.region]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the gemm_op.py file and understand to_buffer_region fd gemm_op.py -type f

Repository: tile-ai/tilelang

Length of output: 230

🏁 Script executed:

# Search for to_buffer_region definition and usage rg -n "def to_buffer_region" -A 10

Repository: tile-ai/tilelang

Length of output: 853

🏁 Script executed:

# Search for how mbar is used in gemm_op.py context rg -n "call_intrin" -g '*.py' -C 3

Repository: tile-ai/tilelang

Length of output: 50372

🏁 Script executed:

# Search for mbar usage patterns in the codebase rg -n "mbar" -g '*.py' -C 2

Repository: tile-ai/tilelang

Length of output: 50372

🏁 Script executed:

cat tilelang/utils/language.py | sed -n '208,280p'

Repository: tile-ai/tilelang

Length of output: 3327

🏁 Script executed:

# Check if BufferRegion is a valid PrimExpr or has conversion rg -n "class BufferRegion" tilelang/ --type py

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check reduce_op.py line 56 to see if similar pattern works sed -n '50,90p' tilelang/language/reduce_op.py

Repository: tile-ai/tilelang

Length of output: 1574

🏁 Script executed:

# Check if BufferRegion or similar patterns cause issues in practice rg -n "BufferRegion" tilelang/language/gemm_op.py -A 2 -B 2

Repository: tile-ai/tilelang

Length of output: 607

🏁 Script executed:

# Look at how the mbar is actually used in the call_intrin at line 110-131 sed -n '106,135p' tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 775

🏁 Script executed:

# Check what buffer_region_to_tile_region does rg -n "def buffer_region_to_tile_region" tilelang/ -A 15

Repository: tile-ai/tilelang

Length of output: 1755

🏁 Script executed:

# Check the region function signature sed -n '7,12p' tilelang/language/utils.py

Repository: tile-ai/tilelang

Length of output: 343

🏁 Script executed:

# Check how mbar is supposed to be normalized - search for gemm implementation details rg -n "mbar" tilelang/tileop/gemm/gemm_base.py -A 2 -B 2 | head -40

Repository: tile-ai/tilelang

Length of output: 361

🏁 Script executed:

# Check how mbar is used in tcgen05 - does it expect BufferLoad or BufferRegion? sed -n '93,130p' tilelang/tileop/gemm/gemm_tcgen05.py

Repository: tile-ai/tilelang

Length of output: 1559

🏁 Script executed:

# Check if there are tests for gemm with mbar fd test.*gemm tilelang/ --type f | head -5 | xargs grep -l "mbar" 2>/dev/null | head -3

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check what the actual gemm operation expects for mbar by looking at the op definition rg -n "tl.tileop.gemm" tilelang/ --type py -A 5 -B 5 | head -60

Repository: tile-ai/tilelang

Length of output: 3003

🏁 Script executed:

# Check if there's documentation on gemm op signature rg -n "tl.gemm" tilelang/language/gemm_op.py -C 3

Repository: tile-ai/tilelang

Length of output: 470

🏁 Script executed:

# Check examples that use mbar to see if they work or if there are issues grep -r "gemm.*mbar" examples/ tilelang/ --include="*.py" | head -5

Repository: tile-ai/tilelang

Length of output: 408

🏁 Script executed:

# Check the actual test for gemm_tcgen5mma to see if it passes with current code find . -name "*test*gemm*" -type f | grep -E "\.(py|cc)$" | head -5

Repository: tile-ai/tilelang

Length of output: 325

🏁 Script executed:

# Check if examples that use mbar actually run without errors cat examples/gemm_sm100/gemm_tcgen5mma.py | head -50

Repository: tile-ai/tilelang

Length of output: 1699

🏁 Script executed:

# Check what type call_intrin actually accepts - look at the C++ or Python side rg -n "def call_intrin" tilelang/language/tir/op.py -A 25

Repository: tile-ai/tilelang

Length of output: 838

🏁 Script executed:

# Check if there's a helper already in the codebase for converting mbar rg -n "_mbar_to_buffer_load" tilelang/ --type py -A 15 | head -40

Repository: tile-ai/tilelang

Length of output: 2597

🏁 Script executed:

# Verify the current state - does the code at line 102 actually cause issues? # Check if there are any error reports or if it's tested grep -r "to_buffer_region(mbar" tilelang/ --include="*.py"

Repository: tile-ai/tilelang

Length of output: 144

🏁 Script executed:

# Verify what the actual expected type is by checking how mbar flows through the system # Look at the intrinsic handling code rg -n "tl.tileop.gemm" tilelang/language/tir/op.py -C 5

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check if there are any existing tests that would catch this find . -path ./build -prune -o -name "test*.py" -type f -exec grep -l "gemm.*mbar" {} \; | head -3

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check if BufferRegion is actually accepted as a valid argument type by looking at type hints rg -n "call_intrin.*handle" tilelang/language/gemm_op.py -A 20 | head -60

Repository: tile-ai/tilelang

Length of output: 779

🏁 Script executed:

# Search for how similar operations in other places convert their arguments rg -n "buffer_region_to_tile_region" tilelang/language/ --type py | grep -v ".pyc"

Repository: tile-ai/tilelang

Length of output: 1179

🏁 Script executed:

# Final verification: check what the actual expected type is for mbar in the intrinsic # by looking at where it's documented or how it's used downstream rg -n "mbar" tilelang/intrinsics/tcgen05_macro_generator.py -C 3 | head -50

Repository: tile-ai/tilelang

Length of output: 1109

🏁 Script executed:

# Check line 55 more carefully - what type is mbar after legalize_arguments? sed -n '39,60p' tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 828

🏁 Script executed:

# Check if there's a transformation pass that converts BufferRegion to something else rg -n "BufferRegion" tilelang/transform/ --type py | head -20

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Final check: does the current code actually work? Look for any known issues or TODOs rg -n "TODO|FIXME|BUG|HACK" tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 42

Use _mbar_to_buffer_load() to convert mbar to a valid BufferLoad for tir.call_intrin.

Line 102 converts mbar to a BufferRegion via to_buffer_region(mbar, access_type="rw"), but BufferRegion is not a PrimExpr and cannot be passed directly to tir.call_intrin. Unlike A, B, C which are converted to tl.region calls via buffer_region_to_tile_region(), mbar is passed raw to the intrinsic at line 129.

Use the existing _mbar_to_buffer_load() helper (already in tilelang/language/builtin.py) to ensure mbar is a valid BufferLoad:

if mbar is not None: assert isinstance(mbar, (tir.Buffer, tir.BufferLoad)), ( f"mbar for tcgen5mma must be a tir.Buffer or tir.BufferLoad, but got {type(mbar)}" ) + from tilelang.language.builtin import _mbar_to_buffer_load + mbar = _mbar_to_buffer_load(mbar) else: mbar = tir.const(0, T.uint32)

This aligns with how mbarrier_wait_parity() and other barrier functions handle mbar conversion.

🤖 Prompt for AI Agents

In `@tilelang/language/gemm_op.py` around lines 98 - 105, The current code converts mbar to a BufferRegion via to_buffer_region (mbar = to_buffer_region(...)) which yields a BufferRegion that cannot be used in tir.call_intrin; instead call the helper _mbar_to_buffer_load to produce a BufferLoad PrimExpr (when mbar is not None) and assign that to mbar, and keep the else branch producing tir.const(0, T.uint32); update the place where tir.call_intrin is invoked to pass this BufferLoad (same variable mbar) just like buffer_region_to_tile_region is used for A/B/C and similar to mbarrier_wait_parity’s handling. Ensure you reference and import _mbar_to_buffer_load if needed.

Rachmanino · 2026-01-27T14:23:44Z

src/target/codegen_cutedsl.cc

      stream << ", " << constant_size << "), (" << constant_size << ",))\n";
    } else if (scope == "shared.barrier") {
-      ICHECK(false) << "Unsupported scope: " << scope;
+      stream << vid << " = tl.alloc_smem(cutlass.Uint64, size_in_elems="


plz review this, since im not very familiar with cutedsl details though ci passed

It would be better to have a simple mbarrier codegen test for this modification.

LeiWang1999

LGTM, Left some comments.

LeiWang1999 · 2026-01-28T03:48:32Z

testing/python/transform/test_tilelang_transform_warp_specialized.py

        B_shared = T.decl_buffer((3, 1, 4, 512), T.float16, scope="shared.dyn")
        C_local = T.decl_buffer((32,), scope="local")
-        T.create_list_of_mbarrier(128, 128, 128, 128, 128, 128)
+        T.call_intrin("handle", tir.op.Op.get("tl.create_list_of_mbarrier"), 128, 128, 128, 128, 128, 128)


why can‘t we keep the T.create_list_of_mbarrier(128, 128, 128, 128, 128, 128)

LeiWang1999 · 2026-01-28T03:48:39Z

testing/python/transform/test_tilelang_transform_warp_specialized.py

                    T.tma_load(
                        T.create_tma_descriptor(6, 2, A.data, 512, 512, 2, 1024, 32, 64, 1, 1, 0, 2, 2, 0),
-                        T.get_mbarrier(k % 3),
+                        T.call_intrin("handle", tir.op.Op.get("tl.get_mbarrier"), k % 3),


LeiWang1999 · 2026-01-28T03:50:50Z

src/target/codegen_cutedsl.cc

      stream << ", " << constant_size << "), (" << constant_size << ",))\n";
    } else if (scope == "shared.barrier") {
-      ICHECK(false) << "Unsupported scope: " << scope;
+      stream << vid << " = tl.alloc_smem(cutlass.Uint64, size_in_elems="


It would be better to have a simple mbarrier codegen test for this modification.

Rachmanino changed the title ~~[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics~~ [WIP][Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics Jan 26, 2026

Rachmanino requested a review from LeiWang1999 January 26, 2026 13:56

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

testing/python/transform/test_tilelang_transform_lower_shared_barrier.py Show resolved Hide resolved

tilelang/language/builtin.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Rachmanino force-pushed the mbar-rft branch from 55960cd to 7275571 Compare January 27, 2026 06:59

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

testing/python/transform/test_tilelang_transform_warp_specialized.py Show resolved Hide resolved

tilelang/language/gemm_op.py Show resolved Hide resolved

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Rachmanino added 9 commits January 27, 2026 16:57

enhance T.alloc_barrier and let T.wait/arrive only accept buffers

6c0f6e7

lint

d1fffa5

lint

2871eb4

fix test

bceda19

tcgen5 check input

e14cc29

lint

bc9d192

fix name conflict

551f7b8

add name check

e4bd948

Rachmanino force-pushed the mbar-rft branch from dbc988a to e4bd948 Compare January 27, 2026 09:01

apply ai suggestions

3e5254c

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

src/transform/lower_hopper_intrin.cc Show resolved Hide resolved

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

fix shared mem disalignment and missing fence for mbarrier init

8bb51e5

Rachmanino changed the title ~~[WIP][Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics~~ [Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics Jan 27, 2026

fix cutedsl codegen

7308bce

Rachmanino commented Jan 27, 2026

View reviewed changes

bugfix

6591a54

LeiWang1999 reviewed Jan 28, 2026

View reviewed changes

[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics #1733

Are you sure you want to change the base?

[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics #1733

Conversation

Rachmanino commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Rachmanino Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 left a comment

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

[Refactor] Enhance `T.alloc_barrier` with new features and deprecate legacy mbarrier related intrinsics #1733

[Refactor] Enhance `T.alloc_barrier` with new features and deprecate legacy mbarrier related intrinsics #1733

Rachmanino commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

Rachmanino Jan 27, 2026 •

edited

Loading