-
Notifications
You must be signed in to change notification settings - Fork 359
[Feat] PDL Support #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feat] PDL Support #1494
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis PR introduces CUDA PDL (Programmatic Dependent Launch) synchronization support to TileLang. It adds public API functions Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Kernel as Kernel<br/>(TileLang)
participant Compiler as TVM Compiler<br/>Pipeline
participant MarkPass as MarkCudaSyncCalls<br/>Pass
participant Codegen as CUDA<br/>Codegen
participant NVRTC as NVRTC<br/>Compiler
participant Runtime as CUDA<br/>Runtime
User->>Kernel: Define kernel with<br/>pdl_trigger()/pdl_sync()
Kernel->>Compiler: Lower to TVM IR<br/>(extern calls)
Compiler->>MarkPass: OptimizeForTarget<br/>phase
MarkPass->>MarkPass: Detect extern calls<br/>by name
MarkPass->>Compiler: Annotate with<br/>kHasTriggerLaunch/<br/>kHasGridSync
Compiler->>Codegen: Code generation<br/>phase
alt Has kHasGridSync
Codegen->>Codegen: Suppress __restrict__<br/>qualifiers
Codegen->>Codegen: Validate no __ldg__<br/>usage
end
Codegen->>NVRTC: CUDA kernel source<br/>(with extern calls)
NVRTC->>NVRTC: Inject PDL header<br/>(if needed)
NVRTC->>NVRTC: Compile to PTX
User->>Runtime: Launch kernel
Runtime->>Runtime: Check pdl_sync_map<br/>for function
alt Function in PDL map
Runtime->>Runtime: Use programmatic<br/>launch attributes
end
Runtime->>Runtime: Execute kernel<br/>with PDL support
Runtime-->>User: Results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)
151-153: Consider truncatinglib_codein error message.Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting
self.lib_codefrom the error, or logging it separately at debug level.🔎 Proposed improvement
if ret.returncode != 0: command_str = " ".join(command) - raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}") + raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")testing/python/jit/test_tilelang_jit_nvrtc.py (1)
505-552: Consider adding a compute capability check for PDL support.Based on
tilelang/contrib/nvcc.py(lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, liketest_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.🔎 Proposed fix
def test_nvrtc_pdl(): """Test pdl.""" + if not check_hopper(): + import pytest + + pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support") N = 64tilelang/jit/adapter/wrapper.py (1)
213-213: Consider usingsetinstead ofdictforpdl_sync_map.The
pdl_sync_mapdictionary always stores0as the value (line 450). If the value is unused, asetwould be more appropriate and clearer in intent.🔎 Proposed change
- self.pdl_sync_map: dict[str, int] | None = {} + self.pdl_sync_map: set[str] | None = set()And update usages:
- if "has_cuda_pdl_sync" in attrs: - self.pdl_sync_map[function_name] = 0 + if "has_cuda_pdl_sync" in attrs: + self.pdl_sync_map.add(function_name)tilelang/transform/__init__.py (1)
403-406: LGTM with suggestion: Consider expanding the docstring.The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the
have_pdlparameter and return type.🔎 Suggested docstring enhancement
def MarkCudaSyncCalls(have_pdl: bool = False): - """MarkCudaSyncCalls""" + """Mark CUDA synchronization calls for PDL support. + + Parameters + ---------- + have_pdl : bool + Whether the target supports PDL (Programmatic Device Launch). + PDL is available on compute capability >= 9.0 (Hopper+). + + Returns + ------- + fpass : tvm.transform.Pass + The result pass + """ return _ffi_api.MarkCudaSyncCalls(have_pdl) # type: ignoretilelang/contrib/nvcc.py (1)
590-596: Minor: Prefix unused variable with underscore.The
minorvariable fromparse_compute_versionis unpacked but never used. This is a valid static analysis hint.Proposed fix
def have_pdl(target): if target.kind.name != "cuda": return False compute_version = get_target_compute_version(target) - major, minor = parse_compute_version(compute_version) + major, _ = parse_compute_version(compute_version) return major >= 9testing/python/jit/test_tilelang_jit_ctypes.py (1)
1-1: Remove unused import.The
tvmimport is not used in this test file.Proposed fix
-from tilelang import tvm as tvm import tilelang.language as Tsrc/transform/lower_pdl.cc (2)
83-83: Typo in class name:ElininateCudaSyncCalls→EliminateCudaSyncCalls.The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.
Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator { +class EliminateCudaSyncCalls : public StmtExprMutator { public: static PrimFunc Substitute(PrimFunc f) { - ElininateCudaSyncCalls mutator; + EliminateCudaSyncCalls mutator; PrimFunc new_f = f; new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body); return new_f; } // ... rest of class ... private: - ElininateCudaSyncCalls() = default; + EliminateCudaSyncCalls() = default; };Also update line 141:
return have_pdl ? MarkCudaSyncCalls::Substitute(f) - : ElininateCudaSyncCalls::Substitute(f); + : EliminateCudaSyncCalls::Substitute(f);
57-78: Duplicateprivate:access specifier.There are two
private:access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.Proposed fix
-private: void CheckCall(const tir::CallNode *call) { if (!call) return; // ... implementation ... } -private: bool has_trigger_launch_ = false; bool has_grid_sync_ = false; MarkCudaSyncCalls() = default; + +private: + void CheckCall(const tir::CallNode *call) { + if (!call) + return; + // ... implementation ... + } + + bool has_trigger_launch_ = false; + bool has_grid_sync_ = false; + + MarkCudaSyncCalls() = default; };Note: The Cppcheck "syntax error" on line 147 is a false positive —
TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for FFI registration.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
docs/programming_guides/instructions.mdsrc/transform/lower_pdl.cctesting/python/jit/test_tilelang_jit_ctypes.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/language/test_tilelang_language_pdl.pytilelang/contrib/nvcc.pytilelang/engine/phase.pytilelang/jit/adapter/libgen.pytilelang/jit/adapter/nvrtc/wrapper.pytilelang/jit/adapter/wrapper.pytilelang/language/__init__.pytilelang/language/pdl.pytilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_ctypes.pytilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
MarkCudaSyncCalls(80-80)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
call_extern(173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
have_pdl(590-595)tilelang/transform/__init__.py (1)
MarkCudaSyncCalls(403-405)tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
test_nvrtc_pdl(505-552)multi_kernels_with_pdl(511-532)main(30-50)main(138-158)main(384-413)main(513-530)kernel(469-484)testing/python/language/test_tilelang_language_pdl.py (2)
main(7-16)main(23-32)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
parse_compute_version(179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
parse_function_call_args(315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
lib_code(287-289)tilelang/jit/adapter/cython/adapter.py (1)
lib_code(371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🪛 Ruff (0.14.8)
tilelang/language/__init__.py
120-120: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
121-121: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/language/pdl.py
4-7: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/contrib/nvcc.py
594-594: Unpacked variable minor is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tilelang/jit/adapter/libgen.py
153-153: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)
141-155: LGTM: PDL kernel launch code structure looks correct.The
KERNEL_LAUNCH_FUNC_CODEcorrectly configurescudaLaunchConfig_twithcudaLaunchAttributeProgrammaticStreamSerializationand usescudaLaunchKernelExfor the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.
313-330: LGTM: PDL branch in dispatch function is correctly structured.The new PDL path properly:
- Parses function call arguments
- Validates argument count matches parameters
- Uses
KERNEL_LAUNCH_FUNC_CODEfor kernel launch with PDL attributes- Includes error checking after launch
448-450: LGTM: PDL attribute detection correctly populates the sync map.The
has_cuda_pdl_syncattribute is checked and the function is registered inpdl_sync_mapfor later use during code generation.tilelang/language/__init__.py (1)
119-123: LGTM!The PDL function re-exports follow the established pattern in this file. The
# noqa: F401comments are appropriate for suppressing "imported but unused" warnings in__init__.pyfiles intended to expose public API.docs/programming_guides/instructions.md (1)
141-144: LGTM!Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.
tilelang/engine/phase.py (1)
255-257: LGTM!The
MarkCudaSyncCallspass is correctly integrated afterSplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.testing/python/language/test_tilelang_language_pdl.py (1)
37-61: Test implementation looks good.The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the
MarkCudaSyncCallspass behavior.src/transform/lower_pdl.cc (1)
29-43: LGTM on the MarkCudaSyncCalls implementation.The mutator correctly scans for CUDA sync extern calls and annotates the
PrimFuncwithhas_cuda_pdl_triggerandhas_cuda_pdl_syncattributes. The pattern of visiting both statements and expressions ensures comprehensive detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)
9-9: Remove duplicate include.The header
tvm/tir/builtin.his included twice (line 9 and line 14).🔎 Proposed fix
#include "../op/builtin.h" #include "../target/utils.h" #include "tvm/ir/type.h" #include "tvm/tir/builtin.h" #include "tvm/tir/expr.h" #include "tvm/tir/stmt.h" #include <tvm/ffi/reflection/registry.h> #include <tvm/tir/analysis.h> -#include <tvm/tir/builtin.h> #include <tvm/tir/stmt_functor.h> #include <tvm/tir/transform.h>Also applies to: 14-14
58-74: Extract duplicated logic into a helper function.The
CheckCallmethods in bothMarkCudaSyncCalls(lines 58-74) andEliminateCudaSyncCalls(lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.🔎 Proposed refactor
Add a helper function in an anonymous namespace:
namespace { // Returns true if the call is a CUDA PDL sync call bool IsPDLSyncCall(const tir::CallNode *call) { if (!call) return false; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; return func_name == "cudaTriggerProgrammaticLaunchCompletion" || func_name == "cudaGridDependencySynchronize"; } } } return false; } // Returns the specific PDL sync type (0=none, 1=trigger, 2=sync) int GetPDLSyncType(const tir::CallNode *call) { if (!call) return 0; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { return 1; } else if (func_name == "cudaGridDependencySynchronize") { return 2; } } } } return 0; } } // anonymous namespaceThen update both classes to use these helpers.
Also applies to: 111-130
138-145: Consider more descriptive pass name.The pass name
"tl.MarkCudaSyncCalls"only mentions "Mark" behavior, but the pass can also eliminate calls whenhave_pdlis false. Consider a name like"tl.ProcessCudaSyncCalls"or document the dual behavior clearly.tilelang/jit/adapter/nvrtc/wrapper.py (1)
479-486: Docstring enhancement suggestion (optional):The docstring is functional but could be improved to match the detail level of
generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure ofpdl_sync_map, and when/why this returns non-empty code:def generate_pdl_sync_code(self, function_name: str) -> str: - """ - Generate Python code to insert PDL synchronization for a given kernel. - """ + """Generate Python code to configure PDL synchronization for a kernel. + + PDL (Programmatic Device Launch) enables programmatic stream serialization + for fine-grained kernel synchronization on compatible CUDA architectures. + + Args: + function_name: Kernel name to check for PDL sync config + + Returns: + Python code that sets programmatic stream serialization attributes, + or empty string if PDL sync is not configured for this kernel. + """
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/lower_pdl.cctilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)
29-81: LGTM!The
MarkCudaSyncCallsclass correctly implements the TVMStmtExprMutatorpattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.
147-151: LGTM! Static analysis false positive can be ignored.The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the
TVM_FFI_STATIC_INIT_BLOCKmacro, which is a valid TVM construct.tilelang/jit/adapter/nvrtc/wrapper.py (3)
172-186: LGTM! Past config overwriting issue has been resolved.The PDL synchronization code correctly assumes the
configobject already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder{11}) ensures this code executes afterconfig = CUlaunchConfig()is created at line 198.
207-207: LGTM! PDL sync placeholder correctly positioned.The placeholder
{11}is correctly placed afterconfigcreation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.
423-424: LGTM! PDL sync code correctly integrated.The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the
{11}placeholder inKERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.Also applies to: 438-438
|
The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
58-74: Consider extracting duplicated CheckCall logic.The logic for detecting CUDA PDL function names is duplicated between
MarkCudaSyncCalls::CheckCallandEliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.🔎 Proposed refactor to extract common helper
Add a helper function in the anonymous namespace:
+namespace { + +enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync }; + +CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) { + if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) { + return CudaPdlCallType::kNone; + } + + if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { + std::string func_name = str_node->value; + if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { + return CudaPdlCallType::kTriggerLaunch; + } else if (func_name == "cudaGridDependencySynchronize") { + return CudaPdlCallType::kGridSync; + } + } + return CudaPdlCallType::kNone; +} + +} // namespace + class MarkCudaSyncCalls : public StmtExprMutator {Then simplify both CheckCall methods:
private: void CheckCall(const tir::CallNode *call) { - if (!call) - return; - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - has_trigger_launch_ = true; - } else if (func_name == "cudaGridDependencySynchronize") { - has_grid_sync_ = true; - } - } - } + auto call_type = GetCudaPdlCallType(call); + if (call_type == CudaPdlCallType::kTriggerLaunch) { + has_trigger_launch_ = true; + } else if (call_type == CudaPdlCallType::kGridSync) { + has_grid_sync_ = true; } }And in
EliminateCudaSyncCalls:private: bool CheckCall(const tir::CallNode *call) { - if (!call) - return false; - - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - return true; - } else if (func_name == "cudaGridDependencySynchronize") { - return true; - } - } - } - } - - return false; + return GetCudaPdlCallType(call) != CudaPdlCallType::kNone; }Also applies to: 111-130
138-145: Pass name doesn't reflect dual behavior.The pass is named
"tl.MarkCudaSyncCalls"but it performs two different operations based on thehave_pdlparameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like"tl.HandleCudaSyncCalls"or"tl.ProcessCudaSyncCalls"to better reflect this dual behavior.🔎 Proposed rename
- return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {}); + return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)
147-151: Static analysis syntax error is a false positive.The Cppcheck syntax error reported at line 147 is a false positive. The
TVM_FFI_STATIC_INIT_BLOCKmacro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.
Thanks for the clarification. From my understanding, the Cython backend itself is fine. In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. |
You can manipulate this in |
Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.
https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207 There is an existing draft PR here: apache/tvm#18604 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)
41-43: Consider optimizing output tensor initialization.Tensors
bandcare initialized withtorch.randnbut are immediately overwritten by the kernel. Consider usingtorch.emptyinstead for better performance.🔎 Proposed optimization
a = torch.randn(N, dtype=torch.float32).cuda() -b = torch.randn(N, dtype=torch.float32).cuda() -c = torch.randn(N, dtype=torch.float32).cuda() +b = torch.empty(N, dtype=torch.float32).cuda() +c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
threads(214-218)tilelang/language/loop.py (1)
Parallel(13-33)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)tilelang/utils/tensor.py (1)
torch_assert_close(231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)
1-6: LGTM - Imports are correct.All necessary imports are present for the PDL test functionality.
8-11: LGTM - Test setup is appropriate.The test function is properly defined with a reasonable test size for validating basic PDL functionality.
14-35: LGTM - PDL usage and kernel logic are correct.The PDL primitives are used correctly:
T.pdl_trigger()is called after the first kernel completes its computation, signaling completionT.pdl_sync()is called before the second kernel begins processing, ensuring proper synchronizationThe kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).
48-55: LGTM - Test execution and verification are correct.The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.
58-59: LGTM - Main guard is correctly implemented.The main guard follows the standard pattern for tilelang test files.
I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding |
|
Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tilelang/language/__init__.py (1)
119-122: LGTM! Consider removing unnecessarynoqadirectives.The PDL function imports are correctly formatted and consistent with the existing pattern in this file. However, Ruff indicates that the
F401check is not enabled in your configuration, making the# noqa: F401directives unnecessary on lines 120-121.Optional cleanup
from .pdl import ( - pdl_trigger, # noqa: F401 - pdl_sync, # noqa: F401 + pdl_trigger, + pdl_sync, )Note: If you prefer to keep the
noqadirectives for consistency with the rest of the file or in caseF401is enabled in the future, that's also reasonable.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/programming_guides/instructions.mdtesting/python/jit/test_tilelang_jit_nvrtc.pytilelang/language/__init__.py
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/programming_guides/instructions.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (1)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
🪛 Ruff (0.14.10)
tilelang/language/__init__.py
120-120: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
121-121: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)
3336-3337: Consider consolidating or clarifying the duplicate comment.The comment "remove restrict due to NVCC's bug" appears twice (lines 3336 and 3423). Additionally, while the implementation correctly suppresses
__restrict__annotations when PDL grid sync is present, the comment could be more specific about what NVCC bug is being worked around. This would aid future maintainers in understanding when this workaround can be removed.Suggested comment improvement
- // remove __restrict__ due to NVCC's bug bool has_cuda_pdl_sync = func->HasNonzeroAttr(tl::attr::kHasGridSync); + // NVCC has issues with __restrict__ on kernel parameters when using PDL (Programmatic Dependent Launch) + // synchronization. Suppress the annotation when kHasGridSync is set.Also applies to: 3387-3387
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/target/codegen_cuda.ccsrc/transform/warp_specialized_rewriter.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/warp_specialized_rewriter.cc (1)
tilelang/language/tir/op.py (1)
call_extern(173-195)
🔇 Additional comments (2)
src/transform/warp_specialized_rewriter.cc (1)
167-178: LGTM: PDL synchronization calls correctly classified as dual-role.The implementation properly identifies PDL-related extern calls (
cudaGridDependencySynchronizeandcudaTriggerProgrammaticLaunchCompletion) and classifies them as both producer and consumer roles, which is appropriate for synchronization primitives. The null checks and string casting are correctly handled.src/target/codegen_cuda.cc (1)
3387-3387: Consistent handling of__restrict__guard across kernel parameter paths is verified.The
has_cuda_pdl_syncguard on__restrict__appears in exactly two code paths incodegen_cuda.cc:PrintFunctionSignature(line 3387) andAddFunction(line 3476). Both use identical guarding logic (!has_cuda_pdl_sync && no_alias && !non_restrict.count(v.get())), ensuring consistent handling throughout CUDA kernel parameter emission. No other code paths emit__restrict__annotations without this guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
46-50: Verify:CheckLDGCallsanalyzes original body, not mutated body.
CheckLDGCallsis invoked onf->body(original) rather thannew_f->body(after mutation). SinceMarkCudaSyncCallsdoesn't structurally modify the IR (only annotates the function), this should be equivalent. However, for clarity and defensive coding, consider using the mutated body:if (mutator.has_grid_sync_) { new_f = WithAttr(std::move(new_f), attr::kHasGridSync, 1); CheckLDGCalls analyzer; - analyzer(f->body); + analyzer(new_f->body); }
60-76: Consider extracting duplicateCheckCalllogic.Both
MarkCudaSyncCallsandEliminateCudaSyncCallscontain nearly identical logic to identify PDL sync calls by name. Consider extracting to a shared helper function.🔎 Suggested refactor
namespace { // Returns: 0 = not a PDL call, 1 = trigger launch, 2 = grid sync int ClassifyPdlCall(const tir::CallNode *call) { if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) return 0; if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { if (str_node->value == "cudaTriggerProgrammaticLaunchCompletion") return 1; if (str_node->value == "cudaGridDependencySynchronize") return 2; } return 0; } } // namespaceAlso applies to: 103-123
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/target/codegen_cuda.ccsrc/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_pdl.cc
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 140-140: syntax error
(syntaxError)
🔇 Additional comments (8)
src/target/codegen_cuda.cc (3)
17-17: LGTM!The include for the attribute definitions is correctly added to support the PDL-related attribute check.
3336-3338: LGTM – PDL gating for__restrict__suppression.The logic correctly suppresses
__restrict__annotations whenkHasGridSyncis set, addressing the documented NVCC incompatibility with PDL. The comment clearly explains the rationale.Also applies to: 3388-3388
3424-3426: LGTM – Consistent withPrintFunctionSignature.The same PDL gating logic is correctly replicated in
AddFunction. This duplication follows the existing pattern forno_aliasandnon_restricthandling in both methods.Also applies to: 3478-3478
src/transform/lower_pdl.cc (5)
1-17: LGTM!File header and includes are correctly structured for the PDL pass implementation.
26-34: LGTM –__ldgincompatibility check.The visitor correctly enforces the constraint that
__ldgcannot be used with PDL synchronization due to the NVCC bug documented in the PR commit message.
95-101: LGTM – PDL call elimination.Replacing PDL calls with
make_zero(op->dtype)is correct for eliminating these calls when PDL is unsupported. The CUDA error type (cudaError_t) is integer-based, so zero (cudaSuccess) is a valid sentinel.
131-138: LGTM – Pass factory correctly dispatches based onhave_pdl.The pass correctly chooses between marking (when PDL is supported) and elimination (when PDL is not supported) based on the
have_pdlparameter, which is determined by compute capability ≥ 9.0 per thehave_pdlfunction innvcc.py.
140-144: Static analysis false positive –TVM_FFI_STATIC_INIT_BLOCKis valid.The Cppcheck syntax error is a false positive.
TVM_FFI_STATIC_INIT_BLOCK()is a TVM macro that expands to valid C++ code for FFI registration. The registration correctly exposes the pass undertl.transform.MarkCudaSyncCalls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
506-556: Test implementation looks correct.The test properly:
- Skips on unsupported hardware
- Defines a multi-kernel PDL workflow with
T.pdl_trigger()andT.pdl_sync()- Verifies both intermediate (
B = A + 1.0) and final (C = B * 2.0) resultsMinor suggestion: Since
bandcare output tensors that will be overwritten by the kernel, consider usingtorch.empty()instead oftorch.randn()on lines 543-544 to avoid unnecessary initialization.🔎 Optional: Use torch.empty() for output tensors
# Create test tensors a = torch.randn(N, dtype=torch.float32).cuda() - b = torch.randn(N, dtype=torch.float32).cuda() - c = torch.randn(N, dtype=torch.float32).cuda() + b = torch.empty(N, dtype=torch.float32).cuda() + c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/target/codegen_cuda.ccsrc/transform/lower_pdl.cctesting/python/jit/test_tilelang_jit_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/target/codegen_cuda.cc
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_pdl.cc
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_cython.py
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (2)
testing/python/jit/test_tilelang_jit_nvrtc.py (3)
testing/python/jit/test_tilelang_jit_cython.py (3)
check_pdl(9-14)multi_kernels_with_pdl(26-47)main(28-45)tilelang/jit/adapter/nvrtc/adapter.py (1)
prim_func(267-269)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
testing/python/jit/test_tilelang_jit_cython.py (2)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)tilelang/utils/tensor.py (1)
torch_assert_close(231-319)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 92-92: syntax error
(syntaxError)
🔇 Additional comments (7)
src/transform/lower_pdl.cc (3)
24-34: LGTM!The
CheckLDGCallsvisitor correctly detects incompatible__ldgusage with PDL synchronization and provides a clear fatal error message.
59-73: LGTM!The extern call detection logic correctly identifies the CUDA PDL function names by inspecting
StringImmNodearguments and sets the appropriate flags.
92-96: TheTVM_FFI_STATIC_INIT_BLOCK()syntax is correct and follows the standard TVM FFI registration pattern used throughout the codebase. The static analysis tool warning is a false positive. This exact pattern appears in 50+ files across the repository (e.g.,src/ir.cc,src/transform/annotate_device_regions.cc,src/transform/cluster_planning.cc, etc.) with no compilation issues.testing/python/jit/test_tilelang_jit_nvrtc.py (1)
498-503: LGTM!The
check_pdl()function correctly checks for compute capability >= 9, consistent with the implementation intest_tilelang_jit_cython.py.testing/python/jit/test_tilelang_jit_cython.py (3)
9-14: LGTM: Correct compute capability check for PDL.The helper correctly validates that CUDA is available and the device supports compute capability 9.0 or higher, which is required for PDL.
25-47: LGTM: PDL synchronization pattern is correctly implemented.The kernel definition properly demonstrates PDL usage:
- First kernel triggers PDL after computing intermediate results (B = A + 1.0)
- Second kernel synchronizes before consuming those results (C = B * 2.0)
The
execution_backend="cython"parameter correctly targets the Cython backend as validated in the PR discussion.
49-65: LGTM: Test execution and verification are sound.The test correctly:
- Compiles the PDL-enabled kernel
- Allocates test tensors and computes reference results
- Executes the kernel and verifies outputs with appropriate tolerances
The verification would catch synchronization failures (e.g., if PDL synchronization were broken, the second kernel might read stale data from
B, causing assertion failures).
silentCoder-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, we only need to implement the extension for the TVM FFI backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
10-10: Remove duplicate include.
tvm/tir/builtin.his included on both line 10 and line 15. Remove one of them.🔎 Proposed fix
#include "tvm/tir/builtin.h" #include "tvm/tir/expr.h" #include "tvm/tir/stmt.h" #include <tvm/ffi/reflection/registry.h> #include <tvm/tir/analysis.h> -#include <tvm/tir/builtin.h> #include <tvm/tir/stmt_functor.h> #include <tvm/tir/transform.h>Also applies to: 15-15
59-73: Consider extracting magic strings to named constants.The function name strings
"cudaTriggerProgrammaticLaunchCompletion"and"cudaGridDependencySynchronize"are hardcoded. While acceptable, extracting them to named constants would improve maintainability.🔎 Proposed refactor
At the top of the file or in the class:
namespace { constexpr const char* kCudaTriggerPDL = "cudaTriggerProgrammaticLaunchCompletion"; constexpr const char* kCudaGridSync = "cudaGridDependencySynchronize"; }Then update the checks:
std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { + if (func_name == kCudaTriggerPDL) { has_trigger_launch_ = true; - } else if (func_name == "cudaGridDependencySynchronize") { + } else if (func_name == kCudaGridSync) { has_grid_sync_ = true; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/lower_pdl.cctilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tilelang/engine/phase.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_pdl.cc
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 92-92: syntax error
(syntaxError)
🔇 Additional comments (3)
src/transform/lower_pdl.cc (3)
24-34: LGTM!The
CheckLDGCallsvisitor correctly detects__ldgcalls and enforces the constraint that they cannot be used with PDL synchronization.
38-57: LGTM!The transformation logic is correct:
- Properly checks PDL support and fails if PDL functions are used when unsupported
- Correctly annotates the function with attributes based on detected PDL calls
- Runs
CheckLDGCallson the original body to detect incompatible__ldgusage
84-96: LGTM! Cppcheck syntax error is a false positive.The pass creation and FFI registration follow standard TVM patterns and are implemented correctly. The Cppcheck syntax error reported at line 92 is a false positive—
TVM_FFI_STATIC_INIT_BLOCK()is a TVM macro that expands to valid C++ code, but Cppcheck doesn't understand the macro expansion.
I have thought about this concern, and my reasoning is as follows.
For example, in the following kernel: __global__ void square(int* array, int n) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < n)
array[tid] = array[tid] * array[tid];
cudaGridDependencySynchronize();
array[tid] = array[tid] * array[tid];
}
# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org
...
STG.E desc[UR4][R2.64], R5
.L_x_1:
BSYNC B0
.L_x_0:
ACQBULK
LDG.E R0, desc[UR4][R2.64]
IMAD R5, R0, R0, RZ
STG.E desc[UR4][R2.64], R5
EXIT
...This positional dependency is the key reason I am cautious about aggressively running If we model PDL via I also considered an alternative using with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
for i in T.Parallel(block_size):
idx = bx2 * block_size + i
if idx < N:
C[idx] = B[idx] * 2.0
T.pdl_sync()
for i in T.Parallel(block_size):
idx = bx2 * block_size + i
T.pdl_sync()
if idx < N:
C[idx] = B[idx] * 2.0In this case, One possible direction is to avoid These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance. |
Testing on SM_80 reveals that NVCC fails due to the A similar pattern can be observed with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tilelang/contrib/nvrtc.pytilelang/jit/adapter/nvrtc/libgen.pytilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
tilelang/jit/adapter/nvrtc/libgen.py
🔇 Additional comments (6)
tilelang/contrib/nvrtc.py (1)
73-76: LGTM! Simple and effective conditional header inclusion for PDL support.The string-based detection for
cudaGridDependencySynchronizeandcudaTriggerProgrammaticLaunchCompletionis pragmatic. The header is correctly inserted after the standard NVRTC header, ensuring the device runtime API declarations are available when PDL intrinsics are used.tilelang/jit/adapter/nvrtc/wrapper.py (5)
45-46: Good refactor: imports moved to module-level template.Moving
CUlaunchAttributeandCUlaunchAttributeIDimports toPREDEF_HOST_FUNC_PYavoids redundant imports per kernel and keeps the generated code cleaner.
174-182: LGTM! Config overwrite issue resolved.The template now correctly modifies the existing
configobject created inKERNEL_LAUNCH_FUNC_PYrather than creating a new one. The programmatic stream serialization attribute is properly configured.
203-203: Correct placement of PDL sync code injection.The placeholder
{11}is positioned afterconfiginitialization but beforecuLaunchKernelEx, ensuring PDL attributes are applied to the launch configuration at the right time.
419-434: Clean integration with existing two-pass code generation.The PDL sync code generation follows the same pattern as L2 persistent map handling, maintaining consistency with the existing per-kernel configuration approach.
475-482: No changes needed.pdl_sync_mapis properly initialized in the parent classTLCUDASourceWrapperat line 213 asself.pdl_sync_map: dict[str, int] | None = {}, and populated at line 450 when kernels have thehas_cuda_pdl_syncattribute. The method correctly accesses this inherited attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)
7-7: LGTM: Module-level pytest import is appropriate.The module-level
pytestimport supports the new PDL test and is best practice. As a follow-up, consider removing the redundant localimport pytestat line 377 intest_tvm_ffi_im2col_tma_desc().
446-451: LGTM: Compute capability check is correct for PDL.The function correctly gates PDL tests on
compute_capability[0] >= 9, aligning with the SM_90+ requirement discussed in the PR comments. Note that an identicalcheck_pdl()helper exists intest_tilelang_jit_nvrtc.py. Consolidating these into a shared test utility (e.g.,tilelang.testing) could reduce duplication across test modules.
454-505: LGTM: PDL test correctly exercises trigger/sync semantics.The test accurately validates PDL synchronization by having the first kernel trigger completion after writing
B, and the second kernel sync before readingB. The PDL API usage aligns with the definitions intilelang/language/pdl.py, and the result validation is thorough.Optional refinements:
- Test coverage: The current test uses
N=64withblock_size=256, launching only a single block per kernel. Consider adding a test case with a largerN(e.g., 1024 or 2048) to verify PDL behavior across multiple thread blocks.- Tensor allocation: Lines 491–492 initialize output tensors
bandcwithtorch.randn()but the kernel immediately overwrites them. Usingtorch.empty()would be slightly more efficient.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
3rdparty/tvmsrc/target/rt_mod_cuda.cctesting/python/jit/test_tilelang_jit_tvm_ffi.py
✅ Files skipped from review due to trivial changes (1)
- 3rdparty/tvm
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_tvm_ffi.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)
1-1: LGTM: Includes are necessary for PDL support.The new includes provide the attribute definitions (
kHasGridSync) and runtime metadata (kUseProgramaticDependentLaunch) required by the PDL detection logic at lines 43-46.Also applies to: 4-4
src/target/rt_mod_cuda.cc
Outdated
| if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) { | ||
| info.launch_param_tags.push_back( | ||
| runtime::launch_param::kUseProgramaticDependentLaunch); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for both spelling variants and the constant definition
echo "=== Searching for 'Programatic' (single-m) ==="
rg 'Programatic' -C 2
echo -e "\n=== Searching for 'Programmatic' (double-m) ==="
rg 'Programmatic' -C 2
echo -e "\n=== Checking the specific constant definition ==="
rg 'kUseProgramatic' -C 3Repository: tile-ai/tilelang
Length of output: 6726
The constant name contains a spelling error that should be corrected.
The constant kUseProgramaticDependentLaunch uses single-m "Programatic", but all CUDA API functions (cudaTriggerProgrammaticLaunchCompletion), attributes (cudaLaunchAttributeProgrammaticStreamSerialization), and comments throughout the codebase consistently use double-m "Programmatic". Rename the constant to kUseProgrammaticDependentLaunch for consistency.
🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 43 to 46, the constant name
kUseProgramaticDependentLaunch is misspelled (single “m”); rename all
uses/definitions to kUseProgrammaticDependentLaunch to match CUDA naming and
existing codebase conventions, update any declaration, definition, and
references (including headers and enums) to the corrected identifier, and run a
project-wide search/replace to ensure consistency and rebuild to catch any
remaining references.
silentCoder-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| \t\tconfig.gridDim = {0}; | ||
| \t\tconfig.blockDim = {1}; | ||
| \t\tconfig.dynamicSmemBytes = {2}; | ||
| \t\tcudaLaunchKernelEx(&config, {4}, {3}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pack kernel args for cudaLaunchKernelEx
cudaLaunchKernelEx expects its third parameter to be a void** array of pointers to the kernel arguments, but in the PDL path {3} is populated from call_args = ", ".join(args_list) (see the PDL branch in this file). That means the generated call passes raw arguments instead of an argument pointer array, so the runtime will interpret the first kernel argument as the void** array (and the next as extra), leading to invalid parameter packing or crashes for any has_cuda_pdl_sync kernel launched via the C++ wrapper.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaLaunchKernelExC uses a void** for its arguments, while cudaLaunchKernelEx employs a template to accept ArgsTypes&&... args.
Add PDL support as per Issue #1463
Summary by CodeRabbit
Release Notes
New Features
pdl_trigger()andpdl_sync()to enable programmatic launch completion and grid dependency synchronization for CUDA kernels.Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.