Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Jan 26, 2026

Features

  • T.alloc_barrier creates and returns one or more mbarriers allocated in one smem buffer
  • Plan to deprecate T.create_list_of_barrier and T.get_mbarrier
  • Other related operators (e.g. T.mbarrier_wait_parity) only accept explicit mbarriers (e.g. mbars[k]) as inputs

Summary by CodeRabbit

  • New Features

    • alloc_barrier now accepts an int or list to allocate multiple barrier slots with per-slot arrive counts and stores per-barrier metadata.
  • API Changes

    • Barrier-oriented operations now accept buffer/handle or indexed barrier arrays; older list/get helpers removed or internalized; some mbar-related signatures widened.
  • Documentation

    • Guides and inline docs updated to use per-barrier allocation/initialization terminology.
  • Tests

    • Added/updated tests validating shared-barrier allocation, indexed usages, gated initialization, and emitted init calls.
  • Refactor

    • Unified barrier handling and guarded initialization flow across transforms and backends.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Barrier handling moved from list-based helpers to buffer-backed allocations: T.alloc_barrier records per-barrier arrival counts in a barrier_init block attribute; lowering consumes barrier_init to emit per-element ptx_init_barrier_thread_count calls, inserts a fence, and wraps the block with barrier Allocate; codegen prints barriers via generic expression printers.

Changes

Cohort / File(s) Summary
API & Language
docs/programming_guides/instructions.md, tilelang/language/allocate.py, tilelang/language/builtin.py, tilelang/language/gemm_op.py, tilelang/tileop/gemm/gemm_base.py
T.alloc_barrier now accepts `int
Lowering / Transform
src/transform/lower_shared_barrier.cc, src/transform/lower_hopper_intrin.cc, src/transform/common/mbarrier.h
Added kBarrierInit attr and CreateMBarrierBuffer helper; lowering reads barrier_init, validates counts vs buffer length, emits per-element ptx_init_barrier_thread_count, emits ptx_fence_barrier_init, and wraps statements with DeclBuffer/Allocate for barrier storage when present.
Codegen
src/target/codegen_cuda.cc, src/target/codegen_cutedsl.cc, src/target/codegen_cuda.h
Removed local print_mbarrier_obj helpers; unified barrier printing via PrintExpr/PrintExpr_; added __align__(barrier_alignment_bytes_) for shared barrier storage and a clarifying comment in the get_mbarrier path.
Intrinsics / Ops / Headers
src/op/builtin.h, src/transform/common/mbarrier.h
Updated inline docs for barrier intrinsics/get_mbarrier; introduced CreateMBarrierBuffer and injected_mbarrier_name_.
Tests
testing/python/transform/test_tilelang_transform_lower_shared_barrier.py, testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py, testing/python/transform/test_tilelang_transform_lower_hopper_intrin.py, testing/python/transform/test_tilelang_transform_warp_specialized.py
Added test for lower_shared_barrier; updated tests to use mbars = T.alloc_barrier(...) and index mbars[...]; adjusted assertions and expected emitted init/fence sequences.
Examples
examples/minference/..., examples/warp_specialize/..., examples/warp_specialize/example_warp_specialize_gemm_barrierpipe_stage2.py
Replaced list-based barrier creation with mbars = T.alloc_barrier(...) and updated barrier references to indexed mbars[...].

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant AllocAPI as Alloc API (T.alloc_barrier)
    participant BlockMeta as Block Metadata (barrier_init)
    participant Transform as lower_shared_barrier
    participant Codegen as CodeGen (CUDA/CUTEDSL)
    participant PTX as Emitted PTX

    User->>AllocAPI: call alloc_barrier([arrive_counts...])
    AllocAPI->>AllocAPI: normalize counts -> list
    AllocAPI->>AllocAPI: allocate barrier buffer (shared.barrier)
    AllocAPI->>BlockMeta: attach barrier_init (buffer.data -> IntImm counts)
    AllocAPI->>User: return barrier buffer

    Transform->>BlockMeta: read barrier_init attribute
    Transform->>Transform: validate counts == buffer length
    loop per barrier element
        Transform->>PTX: emit ptx_init_barrier_thread_count(buffer[i], arrive_count[i])
    end
    Transform->>PTX: emit ptx_fence_barrier_init()
    Codegen->>Codegen: render barrier buffer/index via PrintExpr
    Codegen->>PTX: output barrier calls/identifiers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested labels

experimental

Suggested reviewers

  • LeiWang1999

Poem

🐰 I stack the counts in tidy rows,

Shared memory burrows hold the gates,
Per-barrier calls set threads to close,
A gentle fence ensures no straggling waits,
Hopping kernels sync — and off it goes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring effort: enhancing T.alloc_barrier with new features while deprecating legacy mbarrier APIs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rachmanino Rachmanino changed the title [Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics [WIP][Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics Jan 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/programming_guides/instructions.md (1)

103-109: Fix alloc_barrier syntax typo and clarify list form.

There’s a missing bracket and the signature should reflect the int-or-list form.

✏️ Proposed doc fix
-- `T.alloc_barrier([arrive_count)`: Allocate and initialize one or more mbarriers.
+- `T.alloc_barrier(arrive_count | [arrive_count])`: Allocate and initialize one or more mbarriers.
🤖 Fix all issues with AI agents
In `@testing/python/transform/test_tilelang_transform_lower_shared_barrier.py`:
- Around line 19-21: The local variable barriers from the T.alloc_barrier(mbars)
call is unused and triggers Ruff F841; silence it by either renaming the
variable to a conventional unused name (e.g., _barriers) where T.alloc_barrier
is called or append a per-line ignore (e.g., "# noqa: F841") to that allocation
line so the linter no longer flags the unused binding; update the allocation
site where barriers is created to apply one of these fixes.

In `@tilelang/language/builtin.py`:
- Around line 31-46: The docstring for _mbar_to_buffer_load incorrectly
annotates the parameter as Optional[Buffer | BufferLoad] while the function
signature declares mbar: Buffer | BufferLoad; update the docstring to match the
signature (remove "Optional" and use "mbar: Buffer | BufferLoad") or
alternatively change the function signature to accept Optional[Buffer |
BufferLoad] and handle None accordingly; adjust the Args and Returns text to
reference the same types and ensure the raised TypeError message in
_mbar_to_buffer_load remains accurate for the chosen option.
🧹 Nitpick comments (3)
testing/python/transform/test_tilelang_transform_lower_shared_barrier.py (1)

40-44: Prefer structural assertions over literal barrier counts.

The test currently asserts exact numeric literals in generated code. To make it robust, assert the structure (presence of barrier init calls/count) instead of specific values.

♻️ Proposed refactor
+import re
@@
 def test_lower_shared_barrier():
     mbars = (1, 1, 128, 128)  # list is unhashable so we use tuple here
     kernel = matmul(1024, 1024, 1024, 128, 128, 32, mbars=mbars)
-   
-    assert f"__shared__ uint64_t barriers_mem[{len(mbars)}];" in kernel.get_kernel_source()
-    assert "if (tl::tl_shuffle_elect<0>()) {" in kernel.get_kernel_source()
-    for i in range(len(mbars)):
-        assert f"barriers[{i}].init({mbars[i]});" in kernel.get_kernel_source()
-    print(kernel.get_kernel_source())
+    src = kernel.get_kernel_source()
+    assert re.search(r"__shared__ uint64_t .*_mem\[\d+\];", src)
+    assert "if (tl::tl_shuffle_elect<0>()) {" in src
+    assert len(re.findall(r"\.init\(", src)) == len(mbars)
+    print(src)
Based on learnings, prefer structural source checks over literal numeric values in transform tests.
tilelang/language/allocate.py (1)

152-177: Add input validation to alloc_barrier for better error messages.

While all current call sites use valid inputs, adding validation for empty lists and non-positive/non-integer entries would catch errors earlier with clearer messages, rather than failing cryptically in downstream codegen.

♻️ Proposed validation
     if isinstance(arrive_count, int):
         arrive_count = [arrive_count]
     else:
         arrive_count = list(arrive_count)
+    if not arrive_count:
+        raise ValueError("alloc_barrier requires at least one arrive_count")
+    if any((not isinstance(c, int)) or c <= 0 for c in arrive_count):
+        raise TypeError("alloc_barrier arrive_count entries must be positive ints")
src/transform/lower_shared_barrier.cc (1)

87-103: Defensive hardening: guard IntImmNode cast for clarity (optional).

The code assumes buffer->shape[0] is always a constant IntImm, which is guaranteed by the current API design (all barriers come from alloc_barrier(), which hardcodes shape as a Python list length). However, adding an explicit null check makes the intent clearer and prevents crashes if shape access patterns change in the future. The barrier_init annotation is reliably set by alloc_barrier() for all shared.barrier allocations, so coverage is complete.

♻️ Proposed defensive check
-      ICHECK(arrive_counts.size() == static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
+      const auto* size_imm = buffer->shape[0].as<IntImmNode>();
+      ICHECK(size_imm) << "Barrier buffer size must be a constant IntImm.";
+      ICHECK(arrive_counts.size() == static_cast<size_t>(size_imm->value))
           << "The number of arrive counts (" << arrive_counts.size() 
           << ") must match the barrier buffer size (" << buffer->shape[0] << ") for buffer " << buffer->name;

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/transform/lower_shared_barrier.cc`:
- Around line 103-113: Add a precondition check before dereferencing
buffer->shape[0].as<IntImmNode>(): verify that buffer->shape is non-empty and
that buffer->shape[0].as<IntImmNode>() is non-null using ICHECK with a clear
message mentioning buffer->name (e.g., "barrier buffer shape must be a constant
IntImm for buffer <name>"). Then use the checked IntImmNode->value for the size
comparison with arrive_counts; keep existing barrier_init_map and arrive_counts
logic unchanged.

In `@tilelang/language/builtin.py`:
- Around line 41-45: The helper currently converts a tir.Buffer to
tir.BufferLoad by defaulting to index 0 (when mbar is a tir.Buffer), which
silently misroutes multi-element buffers; update the logic around mbar handling
(where you check isinstance(mbar, tir.BufferLoad) / isinstance(mbar,
tir.Buffer)) to detect the buffer's shape/length and only auto-create
tir.BufferLoad(mbar, [0]) when the buffer is statically size 1; otherwise raise
an explicit error (or assert) requiring the caller to pass an indexed BufferLoad
(e.g., mbars[i]) so multi-element buffers are not implicitly mapped to element
0. Ensure you reference mbar, tir.Buffer, and tir.BufferLoad in the check and
error message.
♻️ Duplicate comments (1)
tilelang/language/builtin.py (1)

31-39: Docstring still marks mbar as Optional.
The signature is non-optional; please align docstring with the actual type.

🧹 Nitpick comments (1)
tilelang/language/allocate.py (1)

152-176: Validate arrive_count values before emitting barrier_init.
Right now any list content (including 0/negative or non-int) will be encoded into IntImm and can produce invalid mbarrier init or confusing errors later. Add basic validation up front.

♻️ Suggested guardrails
 def alloc_barrier(arrive_count: int | list[int]):
@@
     if isinstance(arrive_count, int):
         arrive_count = [arrive_count]
     else:
         arrive_count = list(arrive_count)
+    if not arrive_count:
+        raise ValueError("alloc_barrier expects at least one barrier.")
+    for c in arrive_count:
+        if not isinstance(c, int):
+            raise TypeError(f"arrive_count entries must be int, got {type(c)}.")
+        if c <= 0:
+            raise ValueError(f"arrive_count must be > 0, got {c}.")
     buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")

Comment on lines 103 to +115
for (auto buffer : barrier_buffers) {
auto data = buffer->data;
auto old_buffer = buffer_data_to_buffer_.at(data);
auto new_buffer = buffer_remap_.at(old_buffer);
auto count = old_buffer->shape[0];

auto call =
Call(DataType::Handle(), builtin::ptx_init_barrier_thread_count(),
{BufferLoad(new_buffer, {0}), PrimExpr(count)});
init_mbarrier_calls_.push_back(Evaluate(call));
ICHECK(barrier_init_map.count(data))
<< "Barrier buffer " << buffer->name
<< " not found in barrier_init annotation";
auto arrive_counts = barrier_init_map.at(data);
ICHECK(arrive_counts.size() ==
static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
<< "The number of arrive counts (" << arrive_counts.size()
<< ") must match the barrier buffer size (" << buffer->shape[0]
<< ") for buffer " << buffer->name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against non-constant barrier buffer sizes (null deref).
buffer->shape[0].as<IntImmNode>()->value will crash if the shape isn’t an IntImm. Add an ICHECK and a clear error message before dereferencing.

🐛 Safer size check
       auto arrive_counts = barrier_init_map.at(data);
-      ICHECK(arrive_counts.size() ==
-             static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
+      auto* shape0 = buffer->shape[0].as<IntImmNode>();
+      ICHECK(shape0) << "Barrier buffer size must be a constant IntImm.";
+      ICHECK(arrive_counts.size() == static_cast<size_t>(shape0->value))
           << "The number of arrive counts (" << arrive_counts.size()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (auto buffer : barrier_buffers) {
auto data = buffer->data;
auto old_buffer = buffer_data_to_buffer_.at(data);
auto new_buffer = buffer_remap_.at(old_buffer);
auto count = old_buffer->shape[0];
auto call =
Call(DataType::Handle(), builtin::ptx_init_barrier_thread_count(),
{BufferLoad(new_buffer, {0}), PrimExpr(count)});
init_mbarrier_calls_.push_back(Evaluate(call));
ICHECK(barrier_init_map.count(data))
<< "Barrier buffer " << buffer->name
<< " not found in barrier_init annotation";
auto arrive_counts = barrier_init_map.at(data);
ICHECK(arrive_counts.size() ==
static_cast<size_t>(buffer->shape[0].as<IntImmNode>()->value))
<< "The number of arrive counts (" << arrive_counts.size()
<< ") must match the barrier buffer size (" << buffer->shape[0]
<< ") for buffer " << buffer->name;
for (auto buffer : barrier_buffers) {
auto data = buffer->data;
ICHECK(barrier_init_map.count(data))
<< "Barrier buffer " << buffer->name
<< " not found in barrier_init annotation";
auto arrive_counts = barrier_init_map.at(data);
auto* shape0 = buffer->shape[0].as<IntImmNode>();
ICHECK(shape0) << "Barrier buffer size must be a constant IntImm.";
ICHECK(arrive_counts.size() == static_cast<size_t>(shape0->value))
<< "The number of arrive counts (" << arrive_counts.size()
<< ") must match the barrier buffer size (" << buffer->shape[0]
<< ") for buffer " << buffer->name;
}
🤖 Prompt for AI Agents
In `@src/transform/lower_shared_barrier.cc` around lines 103 - 113, Add a
precondition check before dereferencing buffer->shape[0].as<IntImmNode>():
verify that buffer->shape is non-empty and that
buffer->shape[0].as<IntImmNode>() is non-null using ICHECK with a clear message
mentioning buffer->name (e.g., "barrier buffer shape must be a constant IntImm
for buffer <name>"). Then use the checked IntImmNode->value for the size
comparison with arrive_counts; keep existing barrier_init_map and arrive_counts
logic unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (1)

137-139: Second test not executed when running directly.

The __main__ block only calls test_inject_set_max_nreg() but not test_inject_set_max_nreg_no_set_max_nreg(). Consider adding the second test call or uncommenting tilelang.testing.main() to run all tests.

🔧 Proposed fix
 if __name__ == "__main__":
-    # tilelang.testing.main()
-    test_inject_set_max_nreg()
+    tilelang.testing.main()

Or if manual calls are preferred:

 if __name__ == "__main__":
     # tilelang.testing.main()
     test_inject_set_max_nreg()
+    test_inject_set_max_nreg_no_set_max_nreg()
🧹 Nitpick comments (2)
docs/programming_guides/instructions.md (1)

157-160: Add a short note about reserved barrier IDs for ID-based helpers.
Since the list still mentions T.barrier_wait(id, ...), consider warning users that IDs 1–2 are reserved for internal use when they rely on ID-based helpers. Based on learnings, consider adding a brief note here.

💡 Suggested doc tweak
- - Barriers: `T.alloc_barrier(arrive_count)`.
- - Parity ops: `T.mbarrier_wait_parity(barrier, parity)`, `T.mbarrier_arrive(barrier)`.
- - Expect tx: `T.mbarrier_expect_tx(...)`; sugar: `T.barrier_wait(id, parity=None)`.
+ - Barriers: `T.alloc_barrier(arrive_count)`.
+ - Parity ops: `T.mbarrier_wait_parity(barrier, parity)`, `T.mbarrier_arrive(barrier)`.
+ - Expect tx: `T.mbarrier_expect_tx(...)`; sugar: `T.barrier_wait(id, parity=None)`.
+   Note: IDs 1 and 2 are reserved for internal barriers; for ID-based helpers, start from 3.
src/transform/common/mbarrier.h (1)

30-31: Resolve TODO: avoid injected mbarrier name collisions.
Using a fixed "mbarrier" name can clash with user-defined buffers; consider routing this through a unique prefix or a name-supply helper. Happy to help wire this up if you want.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/language/gemm_op.py (1)

48-55: Remove unnecessary legalize_arguments call on mbar or update its type annotation.

The mbar parameter has type tir.Buffer | tir.BufferLoad | None (line 32), which explicitly excludes tir.Var. However, line 55 calls legalize_arguments(mbar), a function designed to extract .buffer from let-bound variables. This is inconsistent with how A, B, and C are handled—those parameters accept tir.Var and justify the legalize_arguments call, but mbar does not.

If mbar should support let-bound variables, update its type annotation to match A, B, C: tir.Buffer | tir.BufferLoad | tir.Var | None. Otherwise, remove the legalize_arguments call since it serves no purpose and may obscure intent. The subsequent to_buffer_region(mbar) call at line 102 already has its own let-binding handling, so the early extraction via legalize_arguments is redundant in either case.

🤖 Fix all issues with AI agents
In `@testing/python/transform/test_tilelang_transform_warp_specialized.py`:
- Line 79: The test is assigning reserved barrier IDs (0-2) to user barriers;
locate the tl.get_mbarrier calls in test_tilelang_transform_warp_specialized.py
(the same pattern as the working example at tl.get_mbarrier(...) near the other
call) and change the expression k % 3 to k % 3 + 3 so user-defined barriers use
IDs starting at kFirstUsedBarrier (3); update both tl.get_mbarrier usages that
currently compute k % 3 to use k % 3 + 3.

In `@tilelang/language/gemm_op.py`:
- Line 207: The mbar property in gemm_base.py should be updated to return
"tir.Buffer | tir.BufferLoad | None" instead of only "tir.Buffer"; change the
type annotation on the mbar property getter to that union, ensure tir.BufferLoad
is imported/available in that module, and keep the existing
getattr(self.gemm_node, "mbar", None) usage so None is correctly accepted;
update any related docstring/type comments on the mbar property to match the new
annotation.
🧹 Nitpick comments (3)
testing/python/transform/test_tilelang_transform_lower_hopper_intrin.py (2)

36-37: Unused barrier variable in expected output.

The mbarrier variable is allocated but never referenced in the after function. It has # noqa: F841 to suppress the linter warning, but this may indicate that the expected transformed output is incomplete or should show the barrier buffer being used for subsequent operations.

Is this intentional for this specific test case, or should the expected output demonstrate barrier usage?


22-23: Disabled structural equality assertion reduces test coverage.

The tvm.ir.assert_structural_equal check is commented out with a TODO. This means the test currently only verifies that the transform runs without error, not that the output matches the expected after function.

Consider tracking this TODO with an issue or re-enabling the check once the underlying issue is resolved.

Would you like me to open an issue to track re-enabling this structural equality check?

testing/python/transform/test_tilelang_transform_warp_specialized.py (1)

21-22: Disabled structural equality assertion.

Similar to test_tilelang_transform_lower_hopper_intrin.py, the structural equality check is disabled. This reduces test effectiveness to only verifying that transforms execute without errors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tilelang/language/allocate.py`:
- Around line 151-176: The alloc_barrier function must validate arrive_count
before allocating: ensure the normalized arrive_count list is non-empty and that
every element is an integer > 0 (reject negatives, zeros, or non-int values),
raising a clear ValueError when the checks fail; perform these validations
before creating buffer and before building arrive_count_exprs, then proceed to
convert items to IntImm and call block_attr as before (refer to symbols
alloc_barrier, arrive_count, arrive_count_exprs, buffer, block_attr).

In `@tilelang/language/builtin.py`:
- Around line 798-800: The cp_async_barrier_noinc function fails to normalize
its input like the other barrier helpers; update cp_async_barrier_noinc to call
_mbar_to_buffer_load on the incoming barrier before passing it to
tir.call_intrin so the intrinsic always receives a BufferLoad (mirror how
mbarrier_wait_parity, mbarrier_arrive, and mbarrier_expect_tx do it); locate
cp_async_barrier_noinc and replace the direct use of barrier with the result of
_mbar_to_buffer_load(barrier) when calling tir.call_intrin.

Comment on lines +151 to +176
def alloc_barrier(arrive_count: int | list[int]):
"""Allocate a barrier buffer.
Args:
arrive_count (int): The number of threads that need to arrive at the barrier
arrive_count (int | list[int]): The number of threads that need to arrive at each barrier
Returns:
T.Buffer: A TVM buffer object allocated as a barrier
Examples
--------
>>> mbar = alloc_barrier(128) # allocate a barrier with arrive count 128
>>> mbars = alloc_barrier([128] * n) # allocate n barriers with the same arrive count 128
"""
return T.alloc_buffer([arrive_count], _dtypes.uint64, scope="shared.barrier")
# Normalize to list
if isinstance(arrive_count, int):
arrive_count = [arrive_count]
else:
arrive_count = list(arrive_count)
buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")
# Convert to TIR IntImm expressions for C++ pass to consume as Map<Var, Array<PrimExpr>>
# Use buffer.data as key to support multiple barrier buffer allocations
arrive_count_exprs = [IntImm("int32", c) for c in arrive_count]
block_attr({"barrier_init": {buffer.data: arrive_count_exprs}})

return buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing validation for edge cases in alloc_barrier.

The function doesn't validate that the arrive count values are positive or that the list is non-empty. Passing an empty list would create a zero-length buffer, and negative arrive counts would silently propagate invalid values.

🛡️ Proposed validation
 def alloc_barrier(arrive_count: int | list[int]):
     ...
     # Normalize to list
     if isinstance(arrive_count, int):
         arrive_count = [arrive_count]
     else:
         arrive_count = list(arrive_count)
+    if not arrive_count:
+        raise ValueError("alloc_barrier requires at least one barrier (arrive_count cannot be empty)")
+    if any(c <= 0 for c in arrive_count):
+        raise ValueError("arrive_count values must be positive integers")
     buffer = T.alloc_buffer((len(arrive_count),), _dtypes.uint64, scope="shared.barrier")
🤖 Prompt for AI Agents
In `@tilelang/language/allocate.py` around lines 151 - 176, The alloc_barrier
function must validate arrive_count before allocating: ensure the normalized
arrive_count list is non-empty and that every element is an integer > 0 (reject
negatives, zeros, or non-int values), raising a clear ValueError when the checks
fail; perform these validations before creating buffer and before building
arrive_count_exprs, then proceed to convert items to IntImm and call block_attr
as before (refer to symbols alloc_barrier, arrive_count, arrive_count_exprs,
buffer, block_attr).

- Updated the `T.alloc_barrier` function to accept an array of arrive counts instead of individual thread counts.
- Modified the `get_mbarrier` function documentation to clarify its purpose and usage.
- Enhanced the `CreateMBarrierBuffer` function to create mbarrier buffers with shared storage scope.
- Updated tests to reflect changes in mbarrier allocation and usage.
- Adjusted comments in various files to improve clarity regarding mbarrier functionality.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/transform/lower_hopper_intrin.cc`:
- Around line 156-164: The pass LowerHopperIntrin is creating shared.barrier
allocations (via CreateMBarrierBuffer and mbarrier_buffer_ using
injected_mbarrier_name_ and init_mbarrier_calls_) without backend gating, which
causes codegen_cutedsl.cc to hit ICHECK(false) for "shared.barrier"; either gate
the LowerHopperIntrin application to CUDA-only at the Python pass registration
(so the pass only runs for CUDA targets) or modify CuTeDSL codegen to accept
"shared.barrier" allocations (add handling for the allocation scope in
codegen_cutedsl.cc instead of ICHECK(false)); locate the LowerHopperIntrin
invocation/registration in tilelang/transform/__init__.py and the
CreateMBarrierBuffer/mbarrier_buffer_ allocation block in lower_hopper_intrin.cc
to implement the gating, or update codegen_cutedsl.cc to implement
shared.barrier lowering.
🧹 Nitpick comments (1)
src/transform/common/mbarrier.h (1)

23-31: Avoid hard-coded injected barrier names.
injected_mbarrier_name_ = "mbarrier" can collide with user-defined buffers; consider generating a unique prefix/suffix (or exposing a helper to do so). Also consider explicitly including <string> or using constexpr std::string_view to avoid transitive-include reliance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/minference/example_vertical_slash_sparse_attn.py (1)

138-179: Avoid reserved barrier IDs 1 and 2 in user code.

This example uses mbars[1] and mbars[2] (via indices 0..8). In Tilelang’s CUDA/CuTeDSL backends, barrier IDs 1 and 2 are reserved for internal use, which can cause synchronization conflicts. Please offset user barrier indices to start at 3 (e.g., allocate extra barriers and add a base offset).

Based on learnings, user-defined barriers should start from ID 3 to avoid internal conflicts.

🤖 Fix all issues with AI agents
In `@tilelang/language/gemm_op.py`:
- Around line 98-105: The current code converts mbar to a BufferRegion via
to_buffer_region (mbar = to_buffer_region(...)) which yields a BufferRegion that
cannot be used in tir.call_intrin; instead call the helper _mbar_to_buffer_load
to produce a BufferLoad PrimExpr (when mbar is not None) and assign that to
mbar, and keep the else branch producing tir.const(0, T.uint32); update the
place where tir.call_intrin is invoked to pass this BufferLoad (same variable
mbar) just like buffer_region_to_tile_region is used for A/B/C and similar to
mbarrier_wait_parity’s handling. Ensure you reference and import
_mbar_to_buffer_load if needed.
🧹 Nitpick comments (2)
src/transform/lower_hopper_intrin.cc (1)

156-164: Consider unique naming for injected barrier buffer.

Using a fixed "mbarrier" name risks collisions with user-defined buffers (not just shared.barrier). Consider minting a unique name or checking for name conflicts in the PrimFunc buffer map before injecting.

tilelang/language/builtin.py (1)

43-45: Shape check only validates dimensionality, not single-element constraint.

The assertion len(mbar.shape) == 1 verifies the buffer is 1D but doesn't check if shape[0] == 1. A past review suggested guarding against multi-element buffers being silently converted to barrier[0]. The current check allows a multi-element 1D buffer to pass through, potentially masking user errors.

Consider enhancing the validation:

🛡️ Suggested enhancement
     elif isinstance(mbar, tir.Buffer):
-        assert len(mbar.shape) == 1, f"mbarrier must be a single element buffer, but got {mbar.shape}"
+        if len(mbar.shape) != 1:
+            raise TypeError(f"mbarrier Buffer must be 1D, but got shape {mbar.shape}")
+        shape0 = mbar.shape[0]
+        if isinstance(shape0, tir.IntImm) and int(shape0) != 1:
+            raise TypeError(f"mbarrier Buffer has {int(shape0)} elements; use mbars[idx] for multi-element buffers")
         return tir.BufferLoad(mbar, [0])

Comment on lines +98 to 105
if mbar is not None:
assert isinstance(mbar, (tir.Buffer, tir.BufferLoad)), (
f"mbar for tcgen5mma must be a tir.Buffer or tir.BufferLoad, but got {type(mbar)}"
)
mbar = to_buffer_region(mbar, access_type="rw")
else:
mbar = tir.const(0, T.uint32)
C_coords = [r.min for r in C_region.region]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the gemm_op.py file and understand to_buffer_region
fd gemm_op.py -type f

Repository: tile-ai/tilelang

Length of output: 230


🏁 Script executed:

# Search for to_buffer_region definition and usage
rg -n "def to_buffer_region" -A 10

Repository: tile-ai/tilelang

Length of output: 853


🏁 Script executed:

# Search for how mbar is used in gemm_op.py context
rg -n "call_intrin" -g '*.py' -C 3

Repository: tile-ai/tilelang

Length of output: 50372


🏁 Script executed:

# Search for mbar usage patterns in the codebase
rg -n "mbar" -g '*.py' -C 2

Repository: tile-ai/tilelang

Length of output: 50372


🏁 Script executed:

cat tilelang/utils/language.py | sed -n '208,280p'

Repository: tile-ai/tilelang

Length of output: 3327


🏁 Script executed:

# Check if BufferRegion is a valid PrimExpr or has conversion
rg -n "class BufferRegion" tilelang/ --type py

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check reduce_op.py line 56 to see if similar pattern works
sed -n '50,90p' tilelang/language/reduce_op.py

Repository: tile-ai/tilelang

Length of output: 1574


🏁 Script executed:

# Check if BufferRegion or similar patterns cause issues in practice
rg -n "BufferRegion" tilelang/language/gemm_op.py -A 2 -B 2

Repository: tile-ai/tilelang

Length of output: 607


🏁 Script executed:

# Look at how the mbar is actually used in the call_intrin at line 110-131
sed -n '106,135p' tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 775


🏁 Script executed:

# Check what buffer_region_to_tile_region does
rg -n "def buffer_region_to_tile_region" tilelang/ -A 15

Repository: tile-ai/tilelang

Length of output: 1755


🏁 Script executed:

# Check the region function signature
sed -n '7,12p' tilelang/language/utils.py

Repository: tile-ai/tilelang

Length of output: 343


🏁 Script executed:

# Check how mbar is supposed to be normalized - search for gemm implementation details
rg -n "mbar" tilelang/tileop/gemm/gemm_base.py -A 2 -B 2 | head -40

Repository: tile-ai/tilelang

Length of output: 361


🏁 Script executed:

# Check how mbar is used in tcgen05 - does it expect BufferLoad or BufferRegion?
sed -n '93,130p' tilelang/tileop/gemm/gemm_tcgen05.py

Repository: tile-ai/tilelang

Length of output: 1559


🏁 Script executed:

# Check if there are tests for gemm with mbar
fd test.*gemm tilelang/ --type f | head -5 | xargs grep -l "mbar" 2>/dev/null | head -3

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check what the actual gemm operation expects for mbar by looking at the op definition
rg -n "tl.tileop.gemm" tilelang/ --type py -A 5 -B 5 | head -60

Repository: tile-ai/tilelang

Length of output: 3003


🏁 Script executed:

# Check if there's documentation on gemm op signature
rg -n "tl.gemm" tilelang/language/gemm_op.py -C 3

Repository: tile-ai/tilelang

Length of output: 470


🏁 Script executed:

# Check examples that use mbar to see if they work or if there are issues
grep -r "gemm.*mbar" examples/ tilelang/ --include="*.py" | head -5

Repository: tile-ai/tilelang

Length of output: 408


🏁 Script executed:

# Check the actual test for gemm_tcgen5mma to see if it passes with current code
find . -name "*test*gemm*" -type f | grep -E "\.(py|cc)$" | head -5

Repository: tile-ai/tilelang

Length of output: 325


🏁 Script executed:

# Check if examples that use mbar actually run without errors
cat examples/gemm_sm100/gemm_tcgen5mma.py | head -50

Repository: tile-ai/tilelang

Length of output: 1699


🏁 Script executed:

# Check what type call_intrin actually accepts - look at the C++ or Python side
rg -n "def call_intrin" tilelang/language/tir/op.py -A 25

Repository: tile-ai/tilelang

Length of output: 838


🏁 Script executed:

# Check if there's a helper already in the codebase for converting mbar
rg -n "_mbar_to_buffer_load" tilelang/ --type py -A 15 | head -40

Repository: tile-ai/tilelang

Length of output: 2597


🏁 Script executed:

# Verify the current state - does the code at line 102 actually cause issues?
# Check if there are any error reports or if it's tested
grep -r "to_buffer_region(mbar" tilelang/ --include="*.py"

Repository: tile-ai/tilelang

Length of output: 144


🏁 Script executed:

# Verify what the actual expected type is by checking how mbar flows through the system
# Look at the intrinsic handling code
rg -n "tl.tileop.gemm" tilelang/language/tir/op.py -C 5

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if there are any existing tests that would catch this
find . -path ./build -prune -o -name "test*.py" -type f -exec grep -l "gemm.*mbar" {} \; | head -3

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if BufferRegion is actually accepted as a valid argument type by looking at type hints
rg -n "call_intrin.*handle" tilelang/language/gemm_op.py -A 20 | head -60

Repository: tile-ai/tilelang

Length of output: 779


🏁 Script executed:

# Search for how similar operations in other places convert their arguments
rg -n "buffer_region_to_tile_region" tilelang/language/ --type py | grep -v ".pyc"

Repository: tile-ai/tilelang

Length of output: 1179


🏁 Script executed:

# Final verification: check what the actual expected type is for mbar in the intrinsic
# by looking at where it's documented or how it's used downstream
rg -n "mbar" tilelang/intrinsics/tcgen05_macro_generator.py -C 3 | head -50

Repository: tile-ai/tilelang

Length of output: 1109


🏁 Script executed:

# Check line 55 more carefully - what type is mbar after legalize_arguments?
sed -n '39,60p' tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 828


🏁 Script executed:

# Check if there's a transformation pass that converts BufferRegion to something else
rg -n "BufferRegion" tilelang/transform/ --type py | head -20

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Final check: does the current code actually work? Look for any known issues or TODOs
rg -n "TODO|FIXME|BUG|HACK" tilelang/language/gemm_op.py

Repository: tile-ai/tilelang

Length of output: 42


Use _mbar_to_buffer_load() to convert mbar to a valid BufferLoad for tir.call_intrin.

Line 102 converts mbar to a BufferRegion via to_buffer_region(mbar, access_type="rw"), but BufferRegion is not a PrimExpr and cannot be passed directly to tir.call_intrin. Unlike A, B, C which are converted to tl.region calls via buffer_region_to_tile_region(), mbar is passed raw to the intrinsic at line 129.

Use the existing _mbar_to_buffer_load() helper (already in tilelang/language/builtin.py) to ensure mbar is a valid BufferLoad:

if mbar is not None:
    assert isinstance(mbar, (tir.Buffer, tir.BufferLoad)), (
        f"mbar for tcgen5mma must be a tir.Buffer or tir.BufferLoad, but got {type(mbar)}"
    )
+   from tilelang.language.builtin import _mbar_to_buffer_load
+   mbar = _mbar_to_buffer_load(mbar)
else:
    mbar = tir.const(0, T.uint32)

This aligns with how mbarrier_wait_parity() and other barrier functions handle mbar conversion.

🤖 Prompt for AI Agents
In `@tilelang/language/gemm_op.py` around lines 98 - 105, The current code
converts mbar to a BufferRegion via to_buffer_region (mbar =
to_buffer_region(...)) which yields a BufferRegion that cannot be used in
tir.call_intrin; instead call the helper _mbar_to_buffer_load to produce a
BufferLoad PrimExpr (when mbar is not None) and assign that to mbar, and keep
the else branch producing tir.const(0, T.uint32); update the place where
tir.call_intrin is invoked to pass this BufferLoad (same variable mbar) just
like buffer_region_to_tile_region is used for A/B/C and similar to
mbarrier_wait_parity’s handling. Ensure you reference and import
_mbar_to_buffer_load if needed.

@Rachmanino Rachmanino changed the title [WIP][Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics [Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics Jan 27, 2026
stream << ", " << constant_size << "), (" << constant_size << ",))\n";
} else if (scope == "shared.barrier") {
ICHECK(false) << "Unsupported scope: " << scope;
stream << vid << " = tl.alloc_smem(cutlass.Uint64, size_in_elems="
Copy link
Collaborator Author

@Rachmanino Rachmanino Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz review this, since im not very familiar with cutedsl details though ci passed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have a simple mbarrier codegen test for this modification.

Copy link
Member

@LeiWang1999 LeiWang1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Left some comments.

B_shared = T.decl_buffer((3, 1, 4, 512), T.float16, scope="shared.dyn")
C_local = T.decl_buffer((32,), scope="local")
T.create_list_of_mbarrier(128, 128, 128, 128, 128, 128)
T.call_intrin("handle", tir.op.Op.get("tl.create_list_of_mbarrier"), 128, 128, 128, 128, 128, 128)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can‘t we keep the T.create_list_of_mbarrier(128, 128, 128, 128, 128, 128)

T.tma_load(
T.create_tma_descriptor(6, 2, A.data, 512, 512, 2, 1024, 32, 64, 1, 1, 0, 2, 2, 0),
T.get_mbarrier(k % 3),
T.call_intrin("handle", tir.op.Op.get("tl.get_mbarrier"), k % 3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

stream << ", " << constant_size << "), (" << constant_size << ",))\n";
} else if (scope == "shared.barrier") {
ICHECK(false) << "Unsupported scope: " << scope;
stream << vid << " = tl.alloc_smem(cutlass.Uint64, size_in_elems="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have a simple mbarrier codegen test for this modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants