Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Dec 26, 2025

  • Introduced new annotations for parallel loop layouts and predicates in layout.h to improve layout management.
  • Refactored loop lowering logic in copy.cc to utilize a new LowerParallelLoop function, consolidating partitioning and vectorization steps.
  • Updated layout inference to store layout and predicate annotations on For nodes, ensuring proper handling during subsequent transformations.
  • Added checks in lower_tile_op.cc to enforce layout annotations for parallel loops, enhancing error handling and clarity.

This update aims to streamline parallel loop processing and improve the overall efficiency of layout inference in the codebase.

Summary by CodeRabbit

  • Refactor

    • Switched to an annotation-driven model across ops and loop/layout handling, simplifying parallel loop processing and reducing internal state.
    • Layout inference now records layouts/predicates as loop annotations; lowering and passes preserve and consume these annotations.
  • New Features

    • Many ops and builders now accept and expose a unified annotations map for configuration.
    • Added a public parallel-loop lowering API and a validator to consistently handle partitioning, vectorization, and predicates.

✏️ Tip: You can customize this high-level summary in your review settings.

* Introduced new annotations for parallel loop layouts and predicates in layout.h to improve layout management.
* Refactored loop lowering logic in copy.cc to utilize a new LowerParallelLoop function, consolidating partitioning and vectorization steps.
* Updated layout inference to store layout and predicate annotations on For nodes, ensuring proper handling during subsequent transformations.
* Added checks in lower_tile_op.cc to enforce layout annotations for parallel loops, enhancing error handling and clarity.

This update aims to streamline parallel loop processing and improve the overall efficiency of layout inference in the codebase.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 26, 2025

📝 Walkthrough

Walkthrough

Layout inference now annotates parallel For nodes with layout and optional predicate; lowering defers partitioning/vectorization to a new LowerParallelLoop helper. Many ops switched to an annotations map API; Call nodes and several transforms now preserve/pass annotations.

Changes

Cohort / File(s) Summary
Layout attributes
src/layout/layout.h
Added public attribute keys: kParallelLoopLayout, kParallelLoopPredicate, kCoalescedWidth.
Layout inference (annotation-only)
src/transform/layout_inference.cc
Removed in-place partition/vectorize transforms; now writes parallel_loop_layout and optional parallel_loop_predicate annotations on For nodes. Constructor simplified (removed skip-thread-partition state).
Parallel lowering helper
src/transform/loop_partition.h, src/transform/loop_partition.cc
Added tvm::tl::LowerParallelLoop(For, Fragment, Var, Analyzer, Optional<PrimExpr>, bool parallel_loop=true, bool should_vectorize=true) that composes partitioning, vectorization (using cloned analyzer), and optional predicate guarding.
Tile lowering pass
src/transform/lower_tile_op.cc
New VisitStmt_(const ForNode*) handler lowers ForKind::kParallel by reading annotations and calling LowerParallelLoop; includes vectorization/partitioning decision logic and preserves Call annotations.
Loop partition consumers
src/transform/loop_partition.cc, src/transform/loop_partition.h
LowerParallelLoop exposed and used by lowering paths (replaces prior PartitionLoop+VectorizeLoop call sites).
Copy & related ops (annotations API)
src/op/copy.h, src/op/copy.cc, src/op/* , tilelang/language/copy_op.py
Replaced per-field options (coalesced_width, disable_tma, eviction_policy) with unified annotations map on operators; constructors accept optional annotations; lowering uses GetDisableTMA()/GetEvictionPolicy() and delegates non-local parallel lowering to LowerParallelLoop.
AtomicAdd op (annotations API)
src/op/atomic_add.h, src/op/atomic_add.cc, tilelang/language/atomic.py
Replaced use_tma, memory_order, coalesced_width fields with annotations map; added accessors (GetUseTMA, GetMemoryOrder); per-loop annotations derived from top-level annotations and propagated into For nodes.
Other operator constructors
src/op/fill.*, src/op/gemm.*, src/op/gemm_sp.*, src/op/reduce.*, src/op/region.*, src/op/finalize_reducer.*, src/op/gemm_py.*, src/op/gemm_sp_py.*
Many op constructors extended to accept optional annotations parameter (default empty map) to carry metadata without changing core logic.
Operator builder API / registration
src/op/operator.cc, src/op/operator.h
Op builder function type changed to accept (args, annotations); registration macros and builder invocations updated to forward annotations.
IR annotation propagation
src/transform/inject_pipeline.cc, src/transform/multi_version_buffer_rewriter.cc, src/transform/*
Rebuilt Call nodes now preserve/forward call->annotations (Call creations updated to include annotations).
Parallel-loop validation
src/transform/parallel_loop_layout_validator.h
Added validator that enforces consistency between annotated loop layout and nested parallel-loop depth; ensures inner parallel loops lack layout annotations.
Parallel/coalesced key usage
src/op/parallel.cc
Replaced internal key usage with attr::kCoalescedWidth annotation lookup.
Language API plumbing
tilelang/language/tir/op.py, tilelang/language/atomic.py, tilelang/language/copy_op.py
Python wrappers updated to build and pass annotations maps (new function params where applicable, e.g., call_intrin now accepts annotations).
Third-party
3rdparty/tvm
Submodule pointer updated.

Sequence Diagram(s)

sequenceDiagram
    participant LI as LayoutInferencer
    participant For as ForNode (annotated)
    participant LT as LowerTileOpPass
    participant LP as LowerParallelLoop
    participant Part as PartitionLoop
    participant Vec as VectorizeLoop
    participant ITE as IfThenElse

    Note over LI,For: Phase 1 — Inference (no structural transform)
    LI->>For: Infer layout & optional predicate
    LI->>For: Set annotations kParallelLoopLayout / kParallelLoopPredicate

    Note over LT,LP: Phase 2 — Lowering (deferred)
    LT->>For: Detect ForKind::kParallel
    LT->>For: Read kParallelLoopLayout (+ predicate)
    LT->>LP: Call LowerParallelLoop(loop, layout, thread_var, analyzer, predicate, parallel_loop=true, should_vectorize=...)

    Note over LP,Part: LowerParallelLoop steps
    LP->>Part: PartitionLoop(loop, layout) alt parallel_loop true
    Part-->>LP: partitioned loop
    LP->>Vec: VectorizeLoop(partitioned_loop, cloned_analyzer) alt should_vectorize true
    Vec-->>LP: vectorized loop
    alt predicate present
        LP->>ITE: Wrap vectorized loop with IfThenElse(predicate)
        ITE-->>LP: guarded loop
    end
    LP-->>LT: lowered loop replaces original For
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • XuehaiPan
  • chengyupku

Poem

🐰 I hopped through loops and left a mark,

I labeled threads and lit the dark.
Partitioned, vectorized — all on cue,
Annotations guided the lowering crew.
🥕 Cheers for tidy tiles and cleaner view!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.35% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main refactoring effort: introducing layout annotations for parallel operation nodes and copy nodes, which is the core theme across most changed files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…tions

* Added support for annotations in AtomicAddNode and CopyNode to enhance flexibility in memory operations.
* Updated constructors to parse annotations from input arguments, allowing for coalesced width, TMA usage, and memory order to be specified via an annotations map.
* Refactored related logic in atomic_add.cc and copy.cc to utilize the new annotations, improving clarity and maintainability.
* Enhanced Python bindings to support annotations in atomic and copy operations, ensuring consistency across the API.

This update aims to streamline the handling of memory operation parameters and improve the overall usability of the API.
* Updated constructors for various operators (AtomicAdd, Copy, Fill, Gemm, etc.) to accept an annotations map, improving flexibility in handling additional parameters.
* Refactored related logic in operator implementations to utilize the new annotations, ensuring consistency across the API.
* Enhanced Python bindings to support annotations in operator calls, streamlining the interface for users.

This update aims to improve the usability and extensibility of operator functionalities in the codebase.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/language/tir/op.py (1)

120-145: Update docstring to document the annotations parameter

The function forwards annotations to _tvm_op.call_intrin, but the docstring omits it from the Parameters section. Add annotations with a brief description of its expected structure (e.g., optional dict or similar type) to the "Parameters" section so users don't need to infer the parameter type from other files. The forwarding via keyword argument maintains backward compatibility with existing callers.

🧹 Nitpick comments (13)
src/op/reduce.h (2)

128-130: Unused annotations parameter.

The annotations parameter is accepted but not stored in ReduceOpNode. Based on the relevant code snippet from reduce.cc (lines 30-42), the constructor doesn't use or store the annotations. If annotations are not yet needed for ReduceOp, consider documenting this with a comment or storing the parameter for future use to maintain consistency with AtomicAddNode which stores annotations.


168-170: Unused annotations parameter.

Similar to ReduceOp, the annotations parameter is accepted but not stored in CumSumOpNode. The constructor implementation in reduce.cc (lines 496-518) doesn't use the parameter. Consider storing annotations in the node for consistency with the broader refactoring pattern, or document why it's deferred.

src/op/fill.h (1)

48-49: Unused annotations parameter.

The annotations parameter is accepted but not utilized in the constructor implementation (see fill.cc lines 61-95). While this maintains API consistency with other operators, consider either storing annotations in FillNode or documenting that annotation support is planned for future implementation.

src/op/gemm_sp.cc (1)

86-113: Unused annotations parameter.

The annotations parameter is accepted but not stored or used within the constructor body. This aligns with the broader API refactoring pattern, but consider storing it in GemmSPNode for consistency with AtomicAddNode or documenting that annotation support is deferred.

src/op/gemm_sp_py.h (1)

87-89: Unused annotations parameter.

Based on the relevant code snippet from gemm_sp_py.cc (lines 51-86), the annotations parameter is accepted but not utilized. Consider storing annotations in GemmSPPyNode for future use or documenting the deferral.

src/op/fill.cc (1)

61-94: Unused annotations parameter.

The annotations parameter is accepted but never stored in the FillNode or used within the constructor. While this maintains forward compatibility with the annotation-based API, consider storing it in the node to avoid needing another API change later, or add a comment explaining why annotations are deferred for Fill operations.

src/op/region.cc (1)

21-53: Unused annotations parameter.

The annotations parameter is accepted but not stored in RegionOpNode or used within the constructor. Given that RegionOp is described as a "bridge to carry BufferRegion via Call args," consider whether annotations would be useful for this operator or document why they're deferred.

src/transform/lower_tile_op.cc (1)

672-827: Parallel For lowering logic is reasonable; consider clarifying thread binding and analysis passes

The new VisitStmt_(const ForNode* op) cleanly defers actual partitioning/vectorization to LowerParallelLoop while making decisions based on reducer info, local vs non‑local buffers, and CUDA‑vectorizable casts. A few points to consider:

  • The decision to set parallel_loop = !local_register_only && !store_into_local means patterns like A_local[i] = A_global[i] will skip thread partitioning even though they touch global memory. If that’s intentional (per‑thread register caches), it’s worth calling out explicitly in comments/tests; otherwise, you might want a stricter condition (e.g. require that all uses are local before disabling partitioning).
  • has_cast_operations is gated on TargetIsCuda(Target::Current()). Since this pass already has a target_ field taken from the PrimFunc, using that instead of Target::Current() would avoid relying on ambient global state and make behavior more predictable under nested pass pipelines.
  • The body is traversed three times with PostOrderVisit (for local/non‑local, reducers, and casts). If this ever shows up hot in profiles, you could fold those checks into a single visitor over for_node->body, but that’s purely a perf/readability refactor.

Overall, the structure and the way you pass parallel_loop and should_vectorize into LowerParallelLoop look consistent with the new annotation‑driven pipeline.

src/transform/parallel_loop_layout_validator.h (1)

18-135: Parallel loop layout validator matches the new annotation model; be aware it only counts consecutive For chains

The validator nicely encodes the intended contract:

  • Every ForKind::kParallel loop must either be the annotated outermost loop of a consecutive nest or be structurally inside such a nest.
  • InputDim must equal the number of directly nested parallel For nodes.
  • Inner loops are enforced to be annotation‑free.

Two minor points:

  • CountNestedParallelLoops only follows ForNode::body.as<ForNode>(), so “nested” here means literally for → for → … without an intervening Block/If/AttrStmt. If you ever want to support patterns like for { block { for(...) } }, you’ll need a more general notion of the nested parallel cluster.
  • The fatal on any parallel loop missing layout is quite strict; that’s appropriate if LayoutInference is guaranteed to run before any consumer, but it might be worth documenting the expected pass ordering and invariants to avoid surprises when composing with other TIR passes.

Functionally the implementation looks sound.

src/transform/layout_inference.cc (2)

74-245: Alias-aware layout propagation in RunInferStep is a solid robustness improvement

The refactored RunInferStep does a good job of tightening invariants:

  • It adds defensive checks on cur_infer_id, infer_list_, thread_var_vec_, and constant extents before calling InferLayout, which should make layout‑debugging failures much clearer.
  • The propagate_alias lambda that walks buffer_data_to_buffers_ and reshapes layouts onto alias buffers (sharing the same data Var) prevents inconsistent layouts when multiple Buffer views point at the same storage. The IsEqual/ICHECK on pre‑existing entries is also valuable to catch silent divergence.
  • Enqueueing dependent ops via EnqueueWithPriority for alias siblings means the BFS ordering still respects when enough layouts exist to make inference meaningful.

Net effect: fewer “mysterious” layout conflicts when reshapes or aliasing buffers are present. The added checks and warnings around use_list_ look well‑thought‑out.


1158-1176: LayoutInferencer now pushes loop layout/predicate into annotations and delegates lowering; validate pass wiring looks correct

The new flow in this file hangs together:

  • LayoutInferencer::Substitute still runs ParallelLoopFuser, collects operators, and computes LayoutInferenceResult, but LayoutInferencer itself now only annotates For nodes with attr::kParallelLoopLayout and (optionally) attr::kParallelLoopPredicate instead of performing PartitionLoop/VectorizeLoop. That’s exactly what the new LowerParallelLoop/LowerTileOp pipeline expects.
  • In VisitStmt_(const ForNode* op) you first recurse, then look up result_.for_map by the original For handle and attach the fragment plus a simplified predicate when present. Skipping storage of trivially‑true predicates is a nice touch.
  • LayoutInference() now calls ParallelLoopTransformer::Substitute, chooses skip_thread_partition based on whether any thread bindings exist, runs LayoutInferencer::Substitute, and finally enforces invariants using ParallelLoopLayoutValidator::Validate(f->body). That gives a clear separation: inference pass = annotation, subsequent passes = consumers.

One small nit: in VisitStmt_(const BlockNode* op) the scope check is against "local.framgent" (typo). If you were relying on that ICHECK to catch missing fragment layouts, it currently never fires; consider correcting it to "local.fragment" or switching to IsFragmentBuffer(buffer) for consistency with the rest of the code.

Also applies to: 1221-1246, 1259-1273

src/op/copy.h (1)

118-155: Annotations map on CopyNode is consistent with the new design; consider either using or omitting annotations for Conv2DIm2ColOp

The CopyNode changes look good:

  • annotations consolidates coalesced width, TMA disable, and eviction policy into a single map, and the reflection exposure lets Python/IR tooling introspect it.
  • GetDisableTMA() / GetEvictionPolicy() are simple and robust, defaulting to “TMA enabled” and evict_normal when the keys are absent.

On the constructors:

  • Copy(Array<PrimExpr>, Map<String,ObjectRef>) aligns with AtomicAdd and is used in copy.cc to seed node->annotations.
  • Conv2DIm2ColOp(Array<PrimExpr> args, Map<String,ObjectRef> annotations = …) currently doesn’t use its annotations argument in the implementation; if you don’t plan to make Conv2DIm2ColOp annotation‑driven, you might want to drop the parameter to avoid confusion, or at least leave a brief comment explaining that annotations are intentionally ignored for now.

Otherwise this header matches the implementation in copy.cc.

Also applies to: 342-349, 412-417

src/op/copy.cc (1)

1562-1580: Conv2DIm2ColOp constructor signature now accepts annotations but does not use them

Conv2DIm2ColOp::Conv2DIm2ColOp(Array<PrimExpr> args, Map<String,ObjectRef> annotations) adds the annotations parameter but still parses all configuration (including eviction_policy_) positionally from args[2..8] and ignores annotations. That’s fine if Conv2DIm2ColOp isn’t yet migrated to the annotation model, but it is a bit inconsistent with Copy and AtomicAdd.

If you don’t plan to use annotations here in the near term, consider either:

  • Dropping the annotations argument from the signature, or
  • At least consuming known keys (e.g. "eviction_policy") from annotations and asserting consistency with args[8], to avoid silent mismatches.

Right now it behaves like the old API with an unused extra parameter.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d547c6 and ab50c73.

📒 Files selected for processing (33)
  • 3rdparty/tvm
  • src/op/atomic_add.cc
  • src/op/atomic_add.h
  • src/op/copy.cc
  • src/op/copy.h
  • src/op/fill.cc
  • src/op/fill.h
  • src/op/finalize_reducer.cc
  • src/op/finalize_reducer.h
  • src/op/gemm.cc
  • src/op/gemm.h
  • src/op/gemm_py.cc
  • src/op/gemm_py.h
  • src/op/gemm_sp.cc
  • src/op/gemm_sp.h
  • src/op/gemm_sp_py.cc
  • src/op/gemm_sp_py.h
  • src/op/operator.cc
  • src/op/operator.h
  • src/op/reduce.cc
  • src/op/reduce.h
  • src/op/region.cc
  • src/op/region.h
  • src/transform/inject_pipeline.cc
  • src/transform/layout_inference.cc
  • src/transform/loop_partition.cc
  • src/transform/loop_partition.h
  • src/transform/lower_tile_op.cc
  • src/transform/multi_version_buffer_rewriter.cc
  • src/transform/parallel_loop_layout_validator.h
  • tilelang/language/atomic.py
  • tilelang/language/copy_op.py
  • tilelang/language/tir/op.py
✅ Files skipped from review due to trivial changes (1)
  • 3rdparty/tvm
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/language/atomic.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/inject_pipeline.cc
🧬 Code graph analysis (23)
src/op/gemm_sp_py.cc (1)
src/op/gemm_sp_py.h (1)
  • GemmSPPy (83-91)
src/op/gemm_sp_py.h (1)
src/op/gemm_sp_py.cc (1)
  • GemmSPPy (52-87)
src/op/gemm_sp.cc (1)
tilelang/ir.py (1)
  • GemmSP (52-52)
src/op/gemm_py.h (2)
src/op/gemm_py.cc (1)
  • GemmPy (53-93)
src/transform/lower_opaque_block.cc (4)
  • annotations (157-169)
  • annotations (158-158)
  • annotations (226-266)
  • annotations (227-230)
src/op/fill.cc (2)
src/op/fill.h (1)
  • Fill (45-51)
tilelang/ir.py (1)
  • Fill (10-10)
src/op/finalize_reducer.cc (1)
tilelang/ir.py (1)
  • FinalizeReducerOp (56-56)
src/op/gemm_py.cc (2)
tilelang/tileop/gemm/__init__.py (1)
  • GemmPy (56-192)
src/transform/lower_opaque_block.cc (4)
  • annotations (157-169)
  • annotations (158-158)
  • annotations (226-266)
  • annotations (227-230)
src/transform/multi_version_buffer_rewriter.cc (1)
src/transform/inject_pipeline.cc (2)
  • call (122-152)
  • call (122-123)
src/op/gemm_sp.h (1)
src/op/gemm_sp.cc (1)
  • GemmSP (86-113)
src/op/operator.cc (2)
src/transform/layout_inference.cc (16)
  • op (45-51)
  • op (45-45)
  • op (483-550)
  • op (483-483)
  • op (616-706)
  • op (616-616)
  • op (708-763)
  • op (708-708)
  • op (765-774)
  • op (765-765)
  • op (776-781)
  • op (776-776)
  • op (800-828)
  • op (800-800)
  • op (830-858)
  • op (830-830)
src/transform/inject_pipeline.cc (2)
  • call (122-152)
  • call (122-123)
src/transform/inject_pipeline.cc (1)
src/transform/multi_version_buffer_rewriter.cc (2)
  • call (443-473)
  • call (443-444)
src/op/gemm.h (2)
src/op/gemm.cc (1)
  • Gemm (53-95)
tilelang/ir.py (1)
  • Gemm (48-48)
src/op/atomic_add.cc (2)
tilelang/ir.py (1)
  • AtomicAdd (14-14)
src/transform/atomicadd_vectorize.cc (4)
  • node (186-211)
  • node (186-186)
  • node (213-291)
  • node (213-213)
src/op/finalize_reducer.h (2)
src/op/finalize_reducer.cc (1)
  • FinalizeReducerOp (34-44)
tilelang/ir.py (1)
  • FinalizeReducerOp (56-56)
src/op/copy.cc (2)
tilelang/ir.py (2)
  • Copy (18-18)
  • Conv2DIm2ColOp (22-22)
src/op/copy.h (1)
  • GetDisableTMA (140-156)
src/op/reduce.cc (3)
src/op/reduce.h (2)
  • ReduceOp (124-132)
  • CumSumOp (164-172)
tilelang/ir.py (2)
  • ReduceOp (64-64)
  • CumSumOp (68-68)
src/transform/lower_opaque_block.cc (4)
  • annotations (157-169)
  • annotations (158-158)
  • annotations (226-266)
  • annotations (227-230)
src/op/reduce.h (2)
src/op/reduce.cc (2)
  • ReduceOp (31-43)
  • CumSumOp (497-519)
tilelang/ir.py (2)
  • ReduceOp (64-64)
  • CumSumOp (68-68)
tilelang/language/copy_op.py (1)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
src/op/gemm.cc (2)
src/op/gemm.h (1)
  • Gemm (177-183)
tilelang/ir.py (1)
  • Gemm (48-48)
src/transform/loop_partition.h (1)
src/transform/loop_partition.cc (2)
  • LowerParallelLoop (273-297)
  • LowerParallelLoop (273-275)
src/op/copy.h (2)
src/op/copy.cc (2)
  • Copy (106-120)
  • Conv2DIm2ColOp (1564-1580)
tilelang/ir.py (2)
  • Copy (18-18)
  • Conv2DIm2ColOp (22-22)
src/op/region.h (1)
src/op/region.cc (1)
  • RegionOp (21-53)
src/op/atomic_add.h (1)
src/op/atomic_add.cc (1)
  • AtomicAdd (44-58)
🔇 Additional comments (28)
src/op/atomic_add.h (2)

49-66: Well-structured annotation accessors.

The helper methods provide clean access to annotation values with sensible defaults. The implementation correctly checks for the presence of keys and handles type casting safely.


23-27: Good documentation of annotation keys.

The inline comments clearly document the supported annotation keys and their expected types. Consider adding value constraints or examples where applicable (e.g., valid memory_order values).

src/op/gemm_sp.h (1)

112-113: Constructor signature updated consistently.

The declaration matches the implementation in gemm_sp.cc. Note that the annotations parameter is currently unused (see comments on gemm_sp.cc).

src/op/region.h (1)

84-86: LGTM! Annotations parameter added consistently.

The constructor signature has been updated to accept an optional annotations map, maintaining backward compatibility through the default parameter value. This aligns with the broader refactoring to introduce annotation-based configuration across operators.

Note: The annotations are not yet stored or utilized in the constructor implementation (src/op/region.cc), which appears intentional as preparatory work.

src/op/finalize_reducer.h (1)

51-53: LGTM! Constructor signature updated consistently.

The annotations parameter has been added with a default value, preserving backward compatibility. This change aligns with the annotation-driven configuration pattern being introduced across multiple operators in this PR.

src/op/gemm_py.cc (1)

53-53: LGTM! Annotations parameter added for future use.

The constructor signature has been extended to accept an annotations map with a default empty value. While the parameter is not yet utilized in the constructor body, this is consistent with the preparatory refactoring pattern across all Gemm-family operators (Gemm, GemmPy, GemmSP, GemmSPPy) in this PR.

src/op/gemm.h (1)

180-181: LGTM! Annotations parameter added consistently.

The constructor signature update is consistent with the broader refactoring to introduce annotation-driven configuration. The default parameter value maintains backward compatibility for existing callers.

src/op/gemm_py.h (1)

86-87: LGTM! Header updated to match implementation.

The constructor declaration has been updated to accept the annotations parameter, consistent with the implementation in src/op/gemm_py.cc and the broader annotation-driven refactoring across the codebase.

src/op/gemm.cc (1)

53-53: LGTM! Constructor signature updated consistently.

The annotations parameter has been added to the Gemm constructor, aligning with the broader refactoring across all Gemm-family operators. The parameter is not yet utilized, which is consistent with the preparatory nature of this change.

src/op/gemm_sp_py.cc (1)

52-52: LGTM! GemmSPPy constructor signature updated consistently.

The annotations parameter has been added, maintaining consistency with the other Gemm-family operators (Gemm, GemmPy, GemmSP). While not yet utilized in the constructor body, this prepares the API for future annotation-driven configuration.

src/transform/inject_pipeline.cc (1)

151-151: LGTM! Annotations correctly preserved during IR transformation.

The Call node reconstruction now includes call->annotations, which is essential to preserve annotation metadata when rewriting buffer accesses during pipeline injection. Without this change, annotations would be silently dropped during the transformation, potentially breaking downstream passes that rely on this metadata.

This change aligns with similar patterns elsewhere in the codebase (e.g., multi_version_buffer_rewriter.cc line 471).

src/transform/multi_version_buffer_rewriter.cc (1)

472-472: LGTM! Annotations preserved during buffer access rewriting.

The addition of call->annotations ensures that annotations are properly propagated through the Call reconstruction, consistent with similar changes in inject_pipeline.cc and the broader annotation-driven design in this PR.

src/transform/loop_partition.h (2)

29-29: LGTM! Include added for stmt types.

The inclusion of <tvm/tir/stmt.h> properly supports the new Stmt return type in the LowerParallelLoop function.


49-72: LGTM! Well-documented consolidated lowering API.

The new LowerParallelLoop function provides a clean, well-documented interface that consolidates partitioning, vectorization, and optional predicate wrapping. The documentation clearly explains each parameter and the behavior of the function, which will help maintainability.

src/op/finalize_reducer.cc (1)

34-35: LGTM! Constructor signature updated to accept annotations.

The addition of the annotations parameter aligns with the broader pattern across the codebase where all tile operators accept an annotations map. While not currently used within this constructor, this provides a consistent API for future extensibility.

tilelang/language/copy_op.py (2)

20-33: LGTM! Annotations parameter added with clear documentation.

The addition of the annotations parameter and its documentation clearly explains the precedence rules: values in the annotations dict take priority over individual arguments. This provides flexibility while maintaining backward compatibility.


95-107: Annotations dict correctly implements precedence and mapping.

The implementation properly prioritizes values from the annotations dict over individual arguments. The eviction_policy string-to-int mapping is consistent with c2d_im2col.

Minor observation: Line 101 only adds disable_tma when it's True (if disable_tma), meaning disable_tma=False won't be stored in annotations. This appears intentional (omit false flags), but ensure this aligns with downstream expectations.

src/op/operator.cc (1)

34-34: LGTM! Annotations propagated to operator builders.

The updated invocation now passes both call->args and call->annotations to the registered builder, consistent with the updated OpBuilderFunc signature and the TIR_REGISTER_TL_TILE_OP macro changes.

src/op/operator.h (2)

98-99: LGTM! OpBuilderFunc signature updated for annotations.

The type alias now requires builders to accept both Array<PrimExpr> and Map<String, ObjectRef>, enabling annotation-driven configuration across all tile operators.


109-112: LGTM! Registration macro updated to forward annotations.

The updated lambda signature and Entry invocation ensure that all registered tile operators receive annotations, consistent with the updated OpBuilderFunc type alias.

src/op/reduce.cc (2)

31-31: LGTM! ReduceOp constructor accepts annotations.

The updated constructor signature aligns with the annotation-driven pattern adopted across the codebase. While the annotations parameter is not currently used, it provides API consistency and future extensibility.


497-497: LGTM! CumSumOp constructor accepts annotations.

Consistent with the broader pattern, the constructor now accepts an annotations map for future extensibility and API uniformity.

src/transform/loop_partition.cc (2)

32-32: LGTM! Include added for vectorization support.

The inclusion of loop_vectorize.h properly supports the VectorizeLoop call in the new LowerParallelLoop function.


273-297: LGTM! LowerParallelLoop consolidates partitioning, vectorization, and predicate wrapping.

The implementation correctly:

  1. Clones the analyzer to prevent binding conflicts between partitioning and vectorization
  2. Conditionally applies partitioning and vectorization based on the boolean flags
  3. Wraps with predicate only when provided and when this is a parallel loop

The conditional predicate wrapping (parallel_loop must be true) aligns with the design where predicates are specifically for parallel loop bound checking from layout inference.

src/transform/lower_tile_op.cc (1)

300-313: Good: Call reconstruction now preserves annotations across address_of / layout rewrites

When rewriting address_of/BufferLoad calls you now reconstruct Call nodes with the original annotations and span, both in HandleAccessPtrAndOffset and in the ptx_ldmatrix path. This avoids silently dropping metadata attached upstream (e.g. layout, coalesced width, or other hints) during these low‑level rewrites and is important given the annotation‑driven design elsewhere in the PR. No issues here.

Also applies to: 381-383, 478-481

src/op/atomic_add.cc (1)

26-57: AtomicAdd’s move to annotations is coherent; ensure Python/front‑end side supplies matching keys

The changes here are internally consistent:

  • The AtomicAdd constructor now accepts Map<String, ObjectRef> annotations and copies it into AtomicAddNode::annotations, so configuration (TMA usage, memory order, coalesced width, etc.) is centralized rather than positional.
  • MakeSIMTLoop appends GetMemoryOrder() as a third argument to the elementwise atomic call and derives per‑For loop_annotations from the node‑level annotations when attr::kCoalescedWidth is present, which is exactly what AtomicAddInferLayout later reads.
  • Lower() switches from a struct field to GetUseTMA() for the TMA path, and AtomicAddInferLayout correctly overrides the planned vector width when a kCoalescedWidth IntImm is attached to the loop.

The only thing to watch is that all front‑ends (Python tilelang.ir.AtomicAdd, any tests, and existing serialized IR) are updated to populate annotations with the expected keys/types (e.g. use_tma as an IntImm/Bool, memory_order as IntImm, attr::kCoalescedWidth as IntImm), since the positional fields are no longer read here.

Also applies to: 240-297, 373-397, 463-497

src/op/copy.cc (2)

102-120: Copy operator constructor and SIMT loop now correctly drive coalesced-width via annotations

The Copy path is now fully annotation‑driven:

  • Copy::Copy takes an annotations map and stores it on CopyNode::annotations, while still deriving src/dst from the first two region arguments. That matches the registration (set_num_inputs(5)) where the last three positional args are now effectively configuration routed through annotations.
  • MakeSIMTLoop seeds each parallel For with a per‑loop annotations map containing attr::kCoalescedWidth when present on the node, so downstream layout/vectorization code can reason about coalesced width at the loop level instead of re‑parsing scalar args.

This is consistent with the AtomicAdd changes and keeps the SIMT lowering logic clean.

Also applies to: 280-325


349-357: TMA selection now respects per-op disable flag and eviction policy via annotations

The lower/dispatch logic looks coherent after the refactor:

  • Both InferLayout and Lower now compute copy_inst using disable_tma_lower || GetDisableTMA(), so a per‑op "disable_tma" annotation can override the pass‑level kDisableTMALower without changing global config.
  • In LowerBulkCopy and LowerBulkCopy1D, the hardcoded eviction_policy integer has been replaced with GetEvictionPolicy(), which reads "eviction_policy" from the annotations map and defaults to 0. This feeds directly into the TMA descriptor or tma_load/tma_store calls.
  • The rest of the TMA descriptor construction and constraints (stride alignment, inner_box_dim checks, swizzle handling) are unchanged.

Assuming the Python/front‑end side now encodes disable_tma/eviction_policy into annotations instead of positional args, behavior should remain backwards‑compatible while being more flexible.

Also applies to: 726-732, 1416-1454, 1520-1532

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/op/reduce.cc (2)

498-520: Unused annotations parameter in constructor.

Same issue as ReduceOp constructor: the annotations parameter is accepted but never used in the constructor body.

Ensure consistent handling of annotations across both ReduceOp and CumSumOp constructors. If annotations should be stored, both constructors need the same fix.


31-43: Add missing annotations field and assignment to ReduceOp and CumSumOp constructors.

The annotations parameter is accepted but never stored in either constructor. This is a critical bug because layout annotations are silently dropped, inconsistent with other operators like AtomicAddOp and CopyOp which properly assign node->annotations = annotations.

Required fixes:

  1. Add Map<String, ObjectRef> annotations; field to both ReduceOpNode and CumSumOpNode class definitions
  2. Add node->annotations = annotations; assignment in both constructors
  3. Update RegisterReflection() methods to expose the annotations field via .def_ro("annotations", ...)
🧹 Nitpick comments (1)
src/op/gemm_py.cc (1)

53-93: Annotations parameter is accepted but not stored in GemmPyNode.

The annotations parameter is declared in the constructor signature (line 53) and accepted with a default value in the header, but is never referenced in the constructor body (lines 54-93) and GemmPyNode has no field to store it.

This is inconsistent with other tile operators like CopyNode and AtomicAddNode, which both store annotations as a Map<String, ObjectRef> annotations field and expose it via reflection registration.

Either add annotation support to GemmPyNode (matching the pattern in Copy and AtomicAdd operators) or remove the parameter from the constructor signature if annotations are not supported.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab50c73 and 026aa56.

📒 Files selected for processing (7)
  • src/op/fill.cc
  • src/op/finalize_reducer.cc
  • src/op/gemm_py.cc
  • src/op/gemm_sp_py.h
  • src/op/parallel.cc
  • src/op/reduce.cc
  • src/transform/inject_pipeline.cc
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/transform/inject_pipeline.cc
  • src/op/parallel.cc
🧰 Additional context used
🧬 Code graph analysis (5)
src/op/fill.cc (2)
src/op/fill.h (1)
  • Fill (45-51)
tilelang/ir.py (1)
  • Fill (10-10)
src/op/reduce.cc (2)
src/op/reduce.h (2)
  • ReduceOp (124-132)
  • CumSumOp (164-172)
tilelang/ir.py (2)
  • ReduceOp (64-64)
  • CumSumOp (68-68)
src/op/finalize_reducer.cc (1)
tilelang/ir.py (1)
  • FinalizeReducerOp (56-56)
src/op/gemm_sp_py.h (1)
src/op/gemm_sp_py.cc (1)
  • GemmSPPy (52-87)
src/op/gemm_py.cc (1)
src/transform/lower_opaque_block.cc (4)
  • annotations (157-169)
  • annotations (158-158)
  • annotations (226-266)
  • annotations (227-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (3)
src/op/fill.cc (1)

61-61: LGTM: API standardization change.

The addition of the annotations parameter aligns with the PR's goal of standardizing operator constructors to use an annotations map API. While currently unused in the constructor body, this provides a consistent interface across operators and enables future extension without breaking compatibility.

src/op/finalize_reducer.cc (1)

34-35: LGTM: Consistent API standardization.

The constructor signature update mirrors the pattern in other operators (e.g., Fill, Copy) and supports the PR's shift to a centralized annotations map interface. The unused parameter is intentional and maintains API consistency across the operator hierarchy.

src/op/gemm_sp_py.h (1)

87-89: The annotations parameter is intentionally unused in GemmSPPy.

The parameter is accepted for API consistency with the operator registration macro (which requires all operators to accept annotations), but GemmSPPy doesn't currently need annotation support. Other operators like CopyOp demonstrate the pattern for storing and using annotations when needed (e.g., applying loop annotations during lowering). This is intentional design and not a concern.

@LeiWang1999
Copy link
Member Author

@regression-perf

@LeiWang1999
Copy link
Member Author

@regression-perf

@LeiWang1999 LeiWang1999 changed the title [Refactor] Introduce layout annotations ParallelOPNode [Refactor] Introduce layout annotations for ParallelOPNode and CopyNode Dec 27, 2025
@LeiWang1999
Copy link
Member Author

File Original Latency Current Latency Speedup
example_dequant_gemm_w4a8 0.003007 0.003488 0.862099
sparse_mla_fwd 0.407314 0.421108 0.967242
example_gqa_sink_bwd_bhsd_sliding_window 0.0158839 0.0160161 0.991742
example_gqa_decode 0.028382 0.028576 0.993211
example_gemm_intrinsics 0.022624 0.022752 0.994374
example_dequant_gemv_fp16xint4 0.00249877 0.00251264 0.99448
example_warp_specialize_gemm_copy_1_gemm_0 0.024511 0.024608 0.996058
example_gqa_bwd_wgmma_pipelined 0.0449792 0.0451423 0.996386
example_mha_fwd_varlen 0.0211101 0.0211802 0.996686
example_tilelang_gemm_fp8 0.010224 0.010256 0.99688
block_sparse_attn_tilelang 0.00632006 0.00633493 0.997652
example_blocksparse_gemm 0.0151903 0.0152188 0.998128
example_tilelang_gemm_splitk_vectorize_atomicadd 0.0248531 0.0248989 0.99816
example_tilelang_gemm_fp8_2xAcc 0.116096 0.116303 0.998216
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.00961588 0.00962984 0.99855
example_group_per_split_token_cast_to_fp8 0.00591659 0.005925 0.99858
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.00903653 0.00904798 0.998735
example_mha_bwd_bhsd 0.0146668 0.0146849 0.998766
example_tilelang_sparse_gqa_decode_varlen_indice 0.0101678 0.0101778 0.999017
example_linear_attn_bwd 0.0951538 0.0952352 0.999145
example_gqa_bwd 0.0310055 0.0310256 0.999352
example_convolution 0.723297 0.723621 0.999552
example_fusedmoe_tilelang 0.0909745 0.091015 0.999555
example_gqa_bwd_tma_reduce_varlen 0.0400811 0.0400988 0.999557
sparse_mla_bwd 0.164833 0.164881 0.999708
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.00893658 0.00893759 0.999887
fp8_lighting_indexer 0.0142249 0.014225 0.999998
example_dynamic 0.018304 0.018304 1
example_mha_sink_fwd_bhsd 0.0078428 0.00784146 1.00017
example_mha_bwd_bshd 0.0145895 0.0145866 1.0002
example_tilelang_sparse_gqa_decode_varlen_mask 0.0141937 0.0141861 1.00054
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.00958187 0.00957636 1.00058
example_per_token_cast_to_fp8 0.00424677 0.00424387 1.00068
example_tilelang_nsa_fwd 0.00432818 0.00432517 1.0007
example_tilelang_block_sparse_attn 0.00507881 0.00507452 1.00085
example_tilelang_gemm_splitk 0.0248611 0.0248371 1.00097
example_tilelang_nsa_decode 0.00402035 0.00401386 1.00162
example_gemm_schedule 0.0236373 0.023597 1.00171
example_linear_attn_fwd 0.022866 0.0228258 1.00176
example_mha_inference 0.0270114 0.0269633 1.00178
example_gemm_autotune 0.013952 0.01392 1.0023
example_elementwise_add 0.0132636 0.0132312 1.00245
example_mla_decode 0.285121 0.284385 1.00259
example_gemv 0.0405282 0.0404218 1.00263
example_warp_specialize_gemm_softpipe_stage2 0.024256 0.024191 1.00269
tilelang_example_sparse_tensorcore 0.0101507 0.0101228 1.00275
example_mha_bwd_bshd_wgmma_pipelined 0.0157925 0.0157406 1.00329
topk_selector 0.0343295 0.0342162 1.00331
example_warp_specialize_gemm_barrierpipe_stage2 0.025024 0.024929 1.00381
example_mha_sink_fwd_bhsd_sliding_window 0.00796167 0.00792981 1.00402
example_convolution_autotune 0.584667 0.582279 1.0041
example_dequant_groupedgemm_bf16_mxfp4_hopper 0.0121527 0.0120969 1.00462
example_mha_sink_bwd_bhsd_sliding_window 0.0204322 0.0202998 1.00652
example_gqa_sink_bwd_bhsd 0.0260445 0.0258579 1.00722
example_gemm 0.015296 0.015168 1.00844
sparse_mla_fwd_pipelined 0.0686315 0.068032 1.00881
example_warp_specialize_gemm_copy_0_gemm_1 0.025088 0.024736 1.01423
example_mha_sink_bwd_bhsd 0.0346578 0.0341067 1.01616
example_vertical_slash_sparse_attn 0.127254 0.125149 1.01682
example_dequant_gemm_fp4_hopper 0.007488 0.00736 1.01739
example_dequant_gemm_bf16_fp4_hopper 0.00928 0.009088 1.02113
example_dequant_gemm_bf16_mxfp4_hopper 0.008064 0.007872 1.02439
example_topk 0.006944 0.006687 1.03843
example_tilelang_gemm_fp8_intrinsic 0.001616 0.001536 1.05208

Local regressions test can pass, and I double checked the w4a8 case and it's just noise, skip

@LeiWang1999 LeiWang1999 merged commit b6ace13 into tile-ai:main Dec 28, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant