-
Notifications
You must be signed in to change notification settings - Fork 359
[Refactor] Introduce layout annotations for ParallelOPNode and CopyNode
#1539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Introduced new annotations for parallel loop layouts and predicates in layout.h to improve layout management. * Refactored loop lowering logic in copy.cc to utilize a new LowerParallelLoop function, consolidating partitioning and vectorization steps. * Updated layout inference to store layout and predicate annotations on For nodes, ensuring proper handling during subsequent transformations. * Added checks in lower_tile_op.cc to enforce layout annotations for parallel loops, enhancing error handling and clarity. This update aims to streamline parallel loop processing and improve the overall efficiency of layout inference in the codebase.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughLayout inference now annotates parallel For nodes with layout and optional predicate; lowering defers partitioning/vectorization to a new LowerParallelLoop helper. Many ops switched to an annotations map API; Call nodes and several transforms now preserve/pass annotations. Changes
Sequence Diagram(s)sequenceDiagram
participant LI as LayoutInferencer
participant For as ForNode (annotated)
participant LT as LowerTileOpPass
participant LP as LowerParallelLoop
participant Part as PartitionLoop
participant Vec as VectorizeLoop
participant ITE as IfThenElse
Note over LI,For: Phase 1 — Inference (no structural transform)
LI->>For: Infer layout & optional predicate
LI->>For: Set annotations kParallelLoopLayout / kParallelLoopPredicate
Note over LT,LP: Phase 2 — Lowering (deferred)
LT->>For: Detect ForKind::kParallel
LT->>For: Read kParallelLoopLayout (+ predicate)
LT->>LP: Call LowerParallelLoop(loop, layout, thread_var, analyzer, predicate, parallel_loop=true, should_vectorize=...)
Note over LP,Part: LowerParallelLoop steps
LP->>Part: PartitionLoop(loop, layout) alt parallel_loop true
Part-->>LP: partitioned loop
LP->>Vec: VectorizeLoop(partitioned_loop, cloned_analyzer) alt should_vectorize true
Vec-->>LP: vectorized loop
alt predicate present
LP->>ITE: Wrap vectorized loop with IfThenElse(predicate)
ITE-->>LP: guarded loop
end
LP-->>LT: lowered loop replaces original For
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…tions * Added support for annotations in AtomicAddNode and CopyNode to enhance flexibility in memory operations. * Updated constructors to parse annotations from input arguments, allowing for coalesced width, TMA usage, and memory order to be specified via an annotations map. * Refactored related logic in atomic_add.cc and copy.cc to utilize the new annotations, improving clarity and maintainability. * Enhanced Python bindings to support annotations in atomic and copy operations, ensuring consistency across the API. This update aims to streamline the handling of memory operation parameters and improve the overall usability of the API.
* Updated constructors for various operators (AtomicAdd, Copy, Fill, Gemm, etc.) to accept an annotations map, improving flexibility in handling additional parameters. * Refactored related logic in operator implementations to utilize the new annotations, ensuring consistency across the API. * Enhanced Python bindings to support annotations in operator calls, streamlining the interface for users. This update aims to improve the usability and extensibility of operator functionalities in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tilelang/language/tir/op.py (1)
120-145: Update docstring to document theannotationsparameterThe function forwards
annotationsto_tvm_op.call_intrin, but the docstring omits it from the Parameters section. Addannotationswith a brief description of its expected structure (e.g., optional dict or similar type) to the "Parameters" section so users don't need to infer the parameter type from other files. The forwarding via keyword argument maintains backward compatibility with existing callers.
🧹 Nitpick comments (13)
src/op/reduce.h (2)
128-130: Unused annotations parameter.The
annotationsparameter is accepted but not stored inReduceOpNode. Based on the relevant code snippet fromreduce.cc(lines 30-42), the constructor doesn't use or store the annotations. If annotations are not yet needed forReduceOp, consider documenting this with a comment or storing the parameter for future use to maintain consistency withAtomicAddNodewhich stores annotations.
168-170: Unused annotations parameter.Similar to
ReduceOp, theannotationsparameter is accepted but not stored inCumSumOpNode. The constructor implementation inreduce.cc(lines 496-518) doesn't use the parameter. Consider storing annotations in the node for consistency with the broader refactoring pattern, or document why it's deferred.src/op/fill.h (1)
48-49: Unused annotations parameter.The
annotationsparameter is accepted but not utilized in the constructor implementation (seefill.cclines 61-95). While this maintains API consistency with other operators, consider either storing annotations inFillNodeor documenting that annotation support is planned for future implementation.src/op/gemm_sp.cc (1)
86-113: Unused annotations parameter.The
annotationsparameter is accepted but not stored or used within the constructor body. This aligns with the broader API refactoring pattern, but consider storing it inGemmSPNodefor consistency withAtomicAddNodeor documenting that annotation support is deferred.src/op/gemm_sp_py.h (1)
87-89: Unused annotations parameter.Based on the relevant code snippet from
gemm_sp_py.cc(lines 51-86), theannotationsparameter is accepted but not utilized. Consider storing annotations inGemmSPPyNodefor future use or documenting the deferral.src/op/fill.cc (1)
61-94: Unused annotations parameter.The
annotationsparameter is accepted but never stored in theFillNodeor used within the constructor. While this maintains forward compatibility with the annotation-based API, consider storing it in the node to avoid needing another API change later, or add a comment explaining why annotations are deferred for Fill operations.src/op/region.cc (1)
21-53: Unused annotations parameter.The
annotationsparameter is accepted but not stored inRegionOpNodeor used within the constructor. Given that RegionOp is described as a "bridge to carry BufferRegion via Call args," consider whether annotations would be useful for this operator or document why they're deferred.src/transform/lower_tile_op.cc (1)
672-827: Parallel For lowering logic is reasonable; consider clarifying thread binding and analysis passesThe new
VisitStmt_(const ForNode* op)cleanly defers actual partitioning/vectorization toLowerParallelLoopwhile making decisions based on reducer info, local vs non‑local buffers, and CUDA‑vectorizable casts. A few points to consider:
- The decision to set
parallel_loop = !local_register_only && !store_into_localmeans patterns likeA_local[i] = A_global[i]will skip thread partitioning even though they touch global memory. If that’s intentional (per‑thread register caches), it’s worth calling out explicitly in comments/tests; otherwise, you might want a stricter condition (e.g. require that all uses are local before disabling partitioning).has_cast_operationsis gated onTargetIsCuda(Target::Current()). Since this pass already has atarget_field taken from the PrimFunc, using that instead ofTarget::Current()would avoid relying on ambient global state and make behavior more predictable under nested pass pipelines.- The body is traversed three times with
PostOrderVisit(for local/non‑local, reducers, and casts). If this ever shows up hot in profiles, you could fold those checks into a single visitor overfor_node->body, but that’s purely a perf/readability refactor.Overall, the structure and the way you pass
parallel_loopandshould_vectorizeintoLowerParallelLooplook consistent with the new annotation‑driven pipeline.src/transform/parallel_loop_layout_validator.h (1)
18-135: Parallel loop layout validator matches the new annotation model; be aware it only counts consecutive For chainsThe validator nicely encodes the intended contract:
- Every
ForKind::kParallelloop must either be the annotated outermost loop of a consecutive nest or be structurally inside such a nest.InputDimmust equal the number of directly nested parallelFornodes.- Inner loops are enforced to be annotation‑free.
Two minor points:
CountNestedParallelLoopsonly followsForNode::body.as<ForNode>(), so “nested” here means literallyfor → for → …without an interveningBlock/If/AttrStmt. If you ever want to support patterns likefor { block { for(...) } }, you’ll need a more general notion of the nested parallel cluster.- The fatal on any parallel loop missing layout is quite strict; that’s appropriate if
LayoutInferenceis guaranteed to run before any consumer, but it might be worth documenting the expected pass ordering and invariants to avoid surprises when composing with other TIR passes.Functionally the implementation looks sound.
src/transform/layout_inference.cc (2)
74-245: Alias-aware layout propagation in RunInferStep is a solid robustness improvementThe refactored
RunInferStepdoes a good job of tightening invariants:
- It adds defensive checks on
cur_infer_id,infer_list_,thread_var_vec_, and constant extents before callingInferLayout, which should make layout‑debugging failures much clearer.- The
propagate_aliaslambda that walksbuffer_data_to_buffers_and reshapes layouts onto alias buffers (sharing the samedataVar) prevents inconsistent layouts when multipleBufferviews point at the same storage. TheIsEqual/ICHECK on pre‑existing entries is also valuable to catch silent divergence.- Enqueueing dependent ops via
EnqueueWithPriorityfor alias siblings means the BFS ordering still respects when enough layouts exist to make inference meaningful.Net effect: fewer “mysterious” layout conflicts when reshapes or aliasing buffers are present. The added checks and warnings around
use_list_look well‑thought‑out.
1158-1176: LayoutInferencer now pushes loop layout/predicate into annotations and delegates lowering; validate pass wiring looks correctThe new flow in this file hangs together:
LayoutInferencer::Substitutestill runsParallelLoopFuser, collects operators, and computesLayoutInferenceResult, butLayoutInferenceritself now only annotatesFornodes withattr::kParallelLoopLayoutand (optionally)attr::kParallelLoopPredicateinstead of performingPartitionLoop/VectorizeLoop. That’s exactly what the newLowerParallelLoop/LowerTileOppipeline expects.- In
VisitStmt_(const ForNode* op)you first recurse, then look upresult_.for_mapby the originalForhandle and attach the fragment plus a simplified predicate when present. Skipping storage of trivially‑true predicates is a nice touch.LayoutInference()now callsParallelLoopTransformer::Substitute, choosesskip_thread_partitionbased on whether any thread bindings exist, runsLayoutInferencer::Substitute, and finally enforces invariants usingParallelLoopLayoutValidator::Validate(f->body). That gives a clear separation: inference pass = annotation, subsequent passes = consumers.One small nit: in
VisitStmt_(const BlockNode* op)the scope check is against"local.framgent"(typo). If you were relying on that ICHECK to catch missing fragment layouts, it currently never fires; consider correcting it to"local.fragment"or switching toIsFragmentBuffer(buffer)for consistency with the rest of the code.Also applies to: 1221-1246, 1259-1273
src/op/copy.h (1)
118-155: Annotations map on CopyNode is consistent with the new design; consider either using or omitting annotations for Conv2DIm2ColOpThe
CopyNodechanges look good:
annotationsconsolidates coalesced width, TMA disable, and eviction policy into a single map, and the reflection exposure lets Python/IR tooling introspect it.GetDisableTMA()/GetEvictionPolicy()are simple and robust, defaulting to “TMA enabled” andevict_normalwhen the keys are absent.On the constructors:
Copy(Array<PrimExpr>, Map<String,ObjectRef>)aligns withAtomicAddand is used incopy.ccto seednode->annotations.Conv2DIm2ColOp(Array<PrimExpr> args, Map<String,ObjectRef> annotations = …)currently doesn’t use itsannotationsargument in the implementation; if you don’t plan to make Conv2DIm2ColOp annotation‑driven, you might want to drop the parameter to avoid confusion, or at least leave a brief comment explaining that annotations are intentionally ignored for now.Otherwise this header matches the implementation in
copy.cc.Also applies to: 342-349, 412-417
src/op/copy.cc (1)
1562-1580: Conv2DIm2ColOp constructor signature now accepts annotations but does not use them
Conv2DIm2ColOp::Conv2DIm2ColOp(Array<PrimExpr> args, Map<String,ObjectRef> annotations)adds the annotations parameter but still parses all configuration (includingeviction_policy_) positionally fromargs[2..8]and ignoresannotations. That’s fine if Conv2DIm2ColOp isn’t yet migrated to the annotation model, but it is a bit inconsistent withCopyandAtomicAdd.If you don’t plan to use annotations here in the near term, consider either:
- Dropping the
annotationsargument from the signature, or- At least consuming known keys (e.g.
"eviction_policy") fromannotationsand asserting consistency withargs[8], to avoid silent mismatches.Right now it behaves like the old API with an unused extra parameter.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
3rdparty/tvmsrc/op/atomic_add.ccsrc/op/atomic_add.hsrc/op/copy.ccsrc/op/copy.hsrc/op/fill.ccsrc/op/fill.hsrc/op/finalize_reducer.ccsrc/op/finalize_reducer.hsrc/op/gemm.ccsrc/op/gemm.hsrc/op/gemm_py.ccsrc/op/gemm_py.hsrc/op/gemm_sp.ccsrc/op/gemm_sp.hsrc/op/gemm_sp_py.ccsrc/op/gemm_sp_py.hsrc/op/operator.ccsrc/op/operator.hsrc/op/reduce.ccsrc/op/reduce.hsrc/op/region.ccsrc/op/region.hsrc/transform/inject_pipeline.ccsrc/transform/layout_inference.ccsrc/transform/loop_partition.ccsrc/transform/loop_partition.hsrc/transform/lower_tile_op.ccsrc/transform/multi_version_buffer_rewriter.ccsrc/transform/parallel_loop_layout_validator.htilelang/language/atomic.pytilelang/language/copy_op.pytilelang/language/tir/op.py
✅ Files skipped from review due to trivial changes (1)
- 3rdparty/tvm
🚧 Files skipped from review as they are similar to previous changes (1)
- tilelang/language/atomic.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/inject_pipeline.cc
🧬 Code graph analysis (23)
src/op/gemm_sp_py.cc (1)
src/op/gemm_sp_py.h (1)
GemmSPPy(83-91)
src/op/gemm_sp_py.h (1)
src/op/gemm_sp_py.cc (1)
GemmSPPy(52-87)
src/op/gemm_sp.cc (1)
tilelang/ir.py (1)
GemmSP(52-52)
src/op/gemm_py.h (2)
src/op/gemm_py.cc (1)
GemmPy(53-93)src/transform/lower_opaque_block.cc (4)
annotations(157-169)annotations(158-158)annotations(226-266)annotations(227-230)
src/op/fill.cc (2)
src/op/fill.h (1)
Fill(45-51)tilelang/ir.py (1)
Fill(10-10)
src/op/finalize_reducer.cc (1)
tilelang/ir.py (1)
FinalizeReducerOp(56-56)
src/op/gemm_py.cc (2)
tilelang/tileop/gemm/__init__.py (1)
GemmPy(56-192)src/transform/lower_opaque_block.cc (4)
annotations(157-169)annotations(158-158)annotations(226-266)annotations(227-230)
src/transform/multi_version_buffer_rewriter.cc (1)
src/transform/inject_pipeline.cc (2)
call(122-152)call(122-123)
src/op/gemm_sp.h (1)
src/op/gemm_sp.cc (1)
GemmSP(86-113)
src/op/operator.cc (2)
src/transform/layout_inference.cc (16)
op(45-51)op(45-45)op(483-550)op(483-483)op(616-706)op(616-616)op(708-763)op(708-708)op(765-774)op(765-765)op(776-781)op(776-776)op(800-828)op(800-800)op(830-858)op(830-830)src/transform/inject_pipeline.cc (2)
call(122-152)call(122-123)
src/transform/inject_pipeline.cc (1)
src/transform/multi_version_buffer_rewriter.cc (2)
call(443-473)call(443-444)
src/op/gemm.h (2)
src/op/gemm.cc (1)
Gemm(53-95)tilelang/ir.py (1)
Gemm(48-48)
src/op/atomic_add.cc (2)
tilelang/ir.py (1)
AtomicAdd(14-14)src/transform/atomicadd_vectorize.cc (4)
node(186-211)node(186-186)node(213-291)node(213-213)
src/op/finalize_reducer.h (2)
src/op/finalize_reducer.cc (1)
FinalizeReducerOp(34-44)tilelang/ir.py (1)
FinalizeReducerOp(56-56)
src/op/copy.cc (2)
tilelang/ir.py (2)
Copy(18-18)Conv2DIm2ColOp(22-22)src/op/copy.h (1)
GetDisableTMA(140-156)
src/op/reduce.cc (3)
src/op/reduce.h (2)
ReduceOp(124-132)CumSumOp(164-172)tilelang/ir.py (2)
ReduceOp(64-64)CumSumOp(68-68)src/transform/lower_opaque_block.cc (4)
annotations(157-169)annotations(158-158)annotations(226-266)annotations(227-230)
src/op/reduce.h (2)
src/op/reduce.cc (2)
ReduceOp(31-43)CumSumOp(497-519)tilelang/ir.py (2)
ReduceOp(64-64)CumSumOp(68-68)
tilelang/language/copy_op.py (1)
tilelang/language/tir/op.py (1)
call_intrin(120-145)
src/op/gemm.cc (2)
src/op/gemm.h (1)
Gemm(177-183)tilelang/ir.py (1)
Gemm(48-48)
src/transform/loop_partition.h (1)
src/transform/loop_partition.cc (2)
LowerParallelLoop(273-297)LowerParallelLoop(273-275)
src/op/copy.h (2)
src/op/copy.cc (2)
Copy(106-120)Conv2DIm2ColOp(1564-1580)tilelang/ir.py (2)
Copy(18-18)Conv2DIm2ColOp(22-22)
src/op/region.h (1)
src/op/region.cc (1)
RegionOp(21-53)
src/op/atomic_add.h (1)
src/op/atomic_add.cc (1)
AtomicAdd(44-58)
🔇 Additional comments (28)
src/op/atomic_add.h (2)
49-66: Well-structured annotation accessors.The helper methods provide clean access to annotation values with sensible defaults. The implementation correctly checks for the presence of keys and handles type casting safely.
23-27: Good documentation of annotation keys.The inline comments clearly document the supported annotation keys and their expected types. Consider adding value constraints or examples where applicable (e.g., valid memory_order values).
src/op/gemm_sp.h (1)
112-113: Constructor signature updated consistently.The declaration matches the implementation in
gemm_sp.cc. Note that theannotationsparameter is currently unused (see comments ongemm_sp.cc).src/op/region.h (1)
84-86: LGTM! Annotations parameter added consistently.The constructor signature has been updated to accept an optional annotations map, maintaining backward compatibility through the default parameter value. This aligns with the broader refactoring to introduce annotation-based configuration across operators.
Note: The annotations are not yet stored or utilized in the constructor implementation (src/op/region.cc), which appears intentional as preparatory work.
src/op/finalize_reducer.h (1)
51-53: LGTM! Constructor signature updated consistently.The annotations parameter has been added with a default value, preserving backward compatibility. This change aligns with the annotation-driven configuration pattern being introduced across multiple operators in this PR.
src/op/gemm_py.cc (1)
53-53: LGTM! Annotations parameter added for future use.The constructor signature has been extended to accept an annotations map with a default empty value. While the parameter is not yet utilized in the constructor body, this is consistent with the preparatory refactoring pattern across all Gemm-family operators (Gemm, GemmPy, GemmSP, GemmSPPy) in this PR.
src/op/gemm.h (1)
180-181: LGTM! Annotations parameter added consistently.The constructor signature update is consistent with the broader refactoring to introduce annotation-driven configuration. The default parameter value maintains backward compatibility for existing callers.
src/op/gemm_py.h (1)
86-87: LGTM! Header updated to match implementation.The constructor declaration has been updated to accept the annotations parameter, consistent with the implementation in src/op/gemm_py.cc and the broader annotation-driven refactoring across the codebase.
src/op/gemm.cc (1)
53-53: LGTM! Constructor signature updated consistently.The annotations parameter has been added to the Gemm constructor, aligning with the broader refactoring across all Gemm-family operators. The parameter is not yet utilized, which is consistent with the preparatory nature of this change.
src/op/gemm_sp_py.cc (1)
52-52: LGTM! GemmSPPy constructor signature updated consistently.The annotations parameter has been added, maintaining consistency with the other Gemm-family operators (Gemm, GemmPy, GemmSP). While not yet utilized in the constructor body, this prepares the API for future annotation-driven configuration.
src/transform/inject_pipeline.cc (1)
151-151: LGTM! Annotations correctly preserved during IR transformation.The Call node reconstruction now includes
call->annotations, which is essential to preserve annotation metadata when rewriting buffer accesses during pipeline injection. Without this change, annotations would be silently dropped during the transformation, potentially breaking downstream passes that rely on this metadata.This change aligns with similar patterns elsewhere in the codebase (e.g., multi_version_buffer_rewriter.cc line 471).
src/transform/multi_version_buffer_rewriter.cc (1)
472-472: LGTM! Annotations preserved during buffer access rewriting.The addition of
call->annotationsensures that annotations are properly propagated through the Call reconstruction, consistent with similar changes in inject_pipeline.cc and the broader annotation-driven design in this PR.src/transform/loop_partition.h (2)
29-29: LGTM! Include added for stmt types.The inclusion of
<tvm/tir/stmt.h>properly supports the newStmtreturn type in theLowerParallelLoopfunction.
49-72: LGTM! Well-documented consolidated lowering API.The new
LowerParallelLoopfunction provides a clean, well-documented interface that consolidates partitioning, vectorization, and optional predicate wrapping. The documentation clearly explains each parameter and the behavior of the function, which will help maintainability.src/op/finalize_reducer.cc (1)
34-35: LGTM! Constructor signature updated to accept annotations.The addition of the
annotationsparameter aligns with the broader pattern across the codebase where all tile operators accept an annotations map. While not currently used within this constructor, this provides a consistent API for future extensibility.tilelang/language/copy_op.py (2)
20-33: LGTM! Annotations parameter added with clear documentation.The addition of the
annotationsparameter and its documentation clearly explains the precedence rules: values in theannotationsdict take priority over individual arguments. This provides flexibility while maintaining backward compatibility.
95-107: Annotations dict correctly implements precedence and mapping.The implementation properly prioritizes values from the
annotationsdict over individual arguments. Theeviction_policystring-to-int mapping is consistent withc2d_im2col.Minor observation: Line 101 only adds
disable_tmawhen it'sTrue(if disable_tma), meaningdisable_tma=Falsewon't be stored in annotations. This appears intentional (omit false flags), but ensure this aligns with downstream expectations.src/op/operator.cc (1)
34-34: LGTM! Annotations propagated to operator builders.The updated invocation now passes both
call->argsandcall->annotationsto the registered builder, consistent with the updatedOpBuilderFuncsignature and the TIR_REGISTER_TL_TILE_OP macro changes.src/op/operator.h (2)
98-99: LGTM! OpBuilderFunc signature updated for annotations.The type alias now requires builders to accept both
Array<PrimExpr>andMap<String, ObjectRef>, enabling annotation-driven configuration across all tile operators.
109-112: LGTM! Registration macro updated to forward annotations.The updated lambda signature and
Entryinvocation ensure that all registered tile operators receive annotations, consistent with the updatedOpBuilderFunctype alias.src/op/reduce.cc (2)
31-31: LGTM! ReduceOp constructor accepts annotations.The updated constructor signature aligns with the annotation-driven pattern adopted across the codebase. While the annotations parameter is not currently used, it provides API consistency and future extensibility.
497-497: LGTM! CumSumOp constructor accepts annotations.Consistent with the broader pattern, the constructor now accepts an annotations map for future extensibility and API uniformity.
src/transform/loop_partition.cc (2)
32-32: LGTM! Include added for vectorization support.The inclusion of
loop_vectorize.hproperly supports theVectorizeLoopcall in the newLowerParallelLoopfunction.
273-297: LGTM! LowerParallelLoop consolidates partitioning, vectorization, and predicate wrapping.The implementation correctly:
- Clones the analyzer to prevent binding conflicts between partitioning and vectorization
- Conditionally applies partitioning and vectorization based on the boolean flags
- Wraps with predicate only when provided and when this is a parallel loop
The conditional predicate wrapping (
parallel_loopmust be true) aligns with the design where predicates are specifically for parallel loop bound checking from layout inference.src/transform/lower_tile_op.cc (1)
300-313: Good: Call reconstruction now preserves annotations across address_of / layout rewritesWhen rewriting
address_of/BufferLoadcalls you now reconstructCallnodes with the originalannotationsandspan, both inHandleAccessPtrAndOffsetand in theptx_ldmatrixpath. This avoids silently dropping metadata attached upstream (e.g. layout, coalesced width, or other hints) during these low‑level rewrites and is important given the annotation‑driven design elsewhere in the PR. No issues here.Also applies to: 381-383, 478-481
src/op/atomic_add.cc (1)
26-57: AtomicAdd’s move to annotations is coherent; ensure Python/front‑end side supplies matching keysThe changes here are internally consistent:
- The
AtomicAddconstructor now acceptsMap<String, ObjectRef> annotationsand copies it intoAtomicAddNode::annotations, so configuration (TMA usage, memory order, coalesced width, etc.) is centralized rather than positional.MakeSIMTLoopappendsGetMemoryOrder()as a third argument to the elementwise atomic call and derives per‑Forloop_annotationsfrom the node‑levelannotationswhenattr::kCoalescedWidthis present, which is exactly whatAtomicAddInferLayoutlater reads.Lower()switches from a struct field toGetUseTMA()for the TMA path, andAtomicAddInferLayoutcorrectly overrides the planned vector width when akCoalescedWidthIntImm is attached to the loop.The only thing to watch is that all front‑ends (Python
tilelang.ir.AtomicAdd, any tests, and existing serialized IR) are updated to populateannotationswith the expected keys/types (e.g.use_tmaas an IntImm/Bool,memory_orderas IntImm,attr::kCoalescedWidthas IntImm), since the positional fields are no longer read here.Also applies to: 240-297, 373-397, 463-497
src/op/copy.cc (2)
102-120: Copy operator constructor and SIMT loop now correctly drive coalesced-width via annotationsThe Copy path is now fully annotation‑driven:
Copy::Copytakes anannotationsmap and stores it onCopyNode::annotations, while still derivingsrc/dstfrom the first two region arguments. That matches the registration (set_num_inputs(5)) where the last three positional args are now effectively configuration routed through annotations.MakeSIMTLoopseeds each parallelForwith a per‑loopannotationsmap containingattr::kCoalescedWidthwhen present on the node, so downstream layout/vectorization code can reason about coalesced width at the loop level instead of re‑parsing scalar args.This is consistent with the AtomicAdd changes and keeps the SIMT lowering logic clean.
Also applies to: 280-325
349-357: TMA selection now respects per-op disable flag and eviction policy via annotationsThe lower/dispatch logic looks coherent after the refactor:
- Both
InferLayoutandLowernow computecopy_instusingdisable_tma_lower || GetDisableTMA(), so a per‑op"disable_tma"annotation can override the pass‑levelkDisableTMALowerwithout changing global config.- In
LowerBulkCopyandLowerBulkCopy1D, the hardcodedeviction_policyinteger has been replaced withGetEvictionPolicy(), which reads"eviction_policy"from the annotations map and defaults to 0. This feeds directly into the TMA descriptor or tma_load/tma_store calls.- The rest of the TMA descriptor construction and constraints (stride alignment, inner_box_dim checks, swizzle handling) are unchanged.
Assuming the Python/front‑end side now encodes
disable_tma/eviction_policyintoannotationsinstead of positional args, behavior should remain backwards‑compatible while being more flexible.Also applies to: 726-732, 1416-1454, 1520-1532
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/op/reduce.cc (2)
498-520: Unusedannotationsparameter in constructor.Same issue as
ReduceOpconstructor: theannotationsparameter is accepted but never used in the constructor body.Ensure consistent handling of annotations across both
ReduceOpandCumSumOpconstructors. If annotations should be stored, both constructors need the same fix.
31-43: Add missingannotationsfield and assignment to ReduceOp and CumSumOp constructors.The
annotationsparameter is accepted but never stored in either constructor. This is a critical bug because layout annotations are silently dropped, inconsistent with other operators likeAtomicAddOpandCopyOpwhich properly assignnode->annotations = annotations.Required fixes:
- Add
Map<String, ObjectRef> annotations;field to bothReduceOpNodeandCumSumOpNodeclass definitions- Add
node->annotations = annotations;assignment in both constructors- Update
RegisterReflection()methods to expose the annotations field via.def_ro("annotations", ...)
🧹 Nitpick comments (1)
src/op/gemm_py.cc (1)
53-93: Annotations parameter is accepted but not stored in GemmPyNode.The
annotationsparameter is declared in the constructor signature (line 53) and accepted with a default value in the header, but is never referenced in the constructor body (lines 54-93) and GemmPyNode has no field to store it.This is inconsistent with other tile operators like
CopyNodeandAtomicAddNode, which both store annotations as aMap<String, ObjectRef> annotationsfield and expose it via reflection registration.Either add annotation support to GemmPyNode (matching the pattern in Copy and AtomicAdd operators) or remove the parameter from the constructor signature if annotations are not supported.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/op/fill.ccsrc/op/finalize_reducer.ccsrc/op/gemm_py.ccsrc/op/gemm_sp_py.hsrc/op/parallel.ccsrc/op/reduce.ccsrc/transform/inject_pipeline.cc
🚧 Files skipped from review as they are similar to previous changes (2)
- src/transform/inject_pipeline.cc
- src/op/parallel.cc
🧰 Additional context used
🧬 Code graph analysis (5)
src/op/fill.cc (2)
src/op/fill.h (1)
Fill(45-51)tilelang/ir.py (1)
Fill(10-10)
src/op/reduce.cc (2)
src/op/reduce.h (2)
ReduceOp(124-132)CumSumOp(164-172)tilelang/ir.py (2)
ReduceOp(64-64)CumSumOp(68-68)
src/op/finalize_reducer.cc (1)
tilelang/ir.py (1)
FinalizeReducerOp(56-56)
src/op/gemm_sp_py.h (1)
src/op/gemm_sp_py.cc (1)
GemmSPPy(52-87)
src/op/gemm_py.cc (1)
src/transform/lower_opaque_block.cc (4)
annotations(157-169)annotations(158-158)annotations(226-266)annotations(227-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (3)
src/op/fill.cc (1)
61-61: LGTM: API standardization change.The addition of the
annotationsparameter aligns with the PR's goal of standardizing operator constructors to use an annotations map API. While currently unused in the constructor body, this provides a consistent interface across operators and enables future extension without breaking compatibility.src/op/finalize_reducer.cc (1)
34-35: LGTM: Consistent API standardization.The constructor signature update mirrors the pattern in other operators (e.g., Fill, Copy) and supports the PR's shift to a centralized annotations map interface. The unused parameter is intentional and maintains API consistency across the operator hierarchy.
src/op/gemm_sp_py.h (1)
87-89: The annotations parameter is intentionally unused in GemmSPPy.The parameter is accepted for API consistency with the operator registration macro (which requires all operators to accept annotations), but GemmSPPy doesn't currently need annotation support. Other operators like CopyOp demonstrate the pattern for storing and using annotations when needed (e.g., applying loop annotations during lowering). This is intentional design and not a concern.
|
@regression-perf |
|
@regression-perf |
ParallelOPNodeParallelOPNode and CopyNode
Local regressions test can pass, and I double checked the w4a8 case and it's just noise, skip |
This update aims to streamline parallel loop processing and improve the overall efficiency of layout inference in the codebase.
Summary by CodeRabbit
Refactor
New Features
✏️ Tip: You can customize this high-level summary in your review settings.