Skip to content

Conversation

@w169q169
Copy link

@w169q169 w169q169 commented Dec 22, 2025

Add PDL support as per Issue #1463

Summary by CodeRabbit

Release Notes

  • New Features

    • Added synchronization primitives pdl_trigger() and pdl_sync() to enable programmatic launch completion and grid dependency synchronization for CUDA kernels.
    • Added support for CUDA compute capability 9.0+ with programmatic dependent launch synchronization.
  • Documentation

    • Extended instruction reference with new synchronization helper functions in the Annotation helpers section.
  • Tests

    • Added comprehensive test coverage for PDL synchronization across JIT, language, and FFI interfaces.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This PR introduces CUDA PDL (Programmatic Dependent Launch) synchronization support to TileLang. It adds public API functions pdl_trigger() and pdl_sync(), a compiler pass to detect and annotate CUDA PDL extern calls, integration into the optimization pipeline, NVRTC runtime configuration, host-side launch support, and comprehensive tests covering multiple execution paths.

Changes

Cohort / File(s) Summary
Documentation & Public API Exposure
docs/programming_guides/instructions.md, tilelang/language/pdl.py, tilelang/language/__init__.py, tilelang/transform/__init__.py
New module pdl.py exports pdl_trigger() and pdl_sync() as TVM extern call wrappers; re-exported through language package; Python wrapper added to transform module for MarkCudaSyncCalls pass.
Core Compiler Pass & Attributes
src/transform/lower_pdl.cc, src/transform/common/attr.h
Introduces MarkCudaSyncCallsPass with visitor classes to detect extern calls to CUDA PDL functions by name, annotate functions with kHasTriggerLaunch and kHasGridSync attributes, and validate compatibility (raises error if __ldg present when kHasGridSync set).
CUDA Code Generation
src/target/codegen_cuda.cc, src/target/rt_mod_cuda.cc
Conditional suppression of __restrict__ qualifiers when kHasGridSync attribute present; extraction of grid-sync attribute in ExtractFuncInfo to set launch parameter tags for programmatic dependent launch.
IR-level Rewriting
src/transform/warp_specialized_rewriter.cc
Extended role classification to mark certain extern calls (cudaGridDependencySynchronize, cudaTriggerProgrammaticLaunchCompletion) with dual producer/consumer role.
Pipeline Integration
tilelang/engine/phase.py
Injected MarkCudaSyncCalls pass into OptimizeForTarget after SplitHostDevice, conditioned on PDL support availability.
CUDA/NVRTC Support
tilelang/contrib/nvcc.py, tilelang/contrib/nvrtc.py, tilelang/jit/adapter/nvrtc/libgen.py
Helper function have_pdl(target) checks for CUDA compute capability ≥ 9; conditional header cuda_device_runtime_api.h injection when PDL symbols detected; NVRTC C++ standard library path selection based on CUDA compiler version.
NVRTC Launch Wrapper
tilelang/jit/adapter/nvrtc/wrapper.py
New generate_pdl_sync_code() method and PDL_SYNC_PY constant for per-kernel synchronization via CUDA launch attributes; integrated into kernel launch code generation via placeholder substitution.
Host Kernel Launch Wrapper
tilelang/jit/adapter/wrapper.py
New pdl_sync_map attribute to track PDL-enabled functions; new kernel launch path for programmatic stream serialization when function in map; iteration change to extract function names via g_var.name_hint.
Language & Integration Tests
testing/python/language/test_tilelang_language_pdl.py, testing/python/jit/test_tilelang_jit_nvrtc.py, testing/python/jit/test_tilelang_jit_cython.py, testing/python/jit/test_tilelang_jit_tvm_ffi.py
Four test suites validating PDL codegen (source contains expected extern call names), NVRTC execution with PDL (compute result validation), and host-execution paths (Cython and TVM FFI backends) with CM ≥ 9 gating.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Kernel as Kernel<br/>(TileLang)
    participant Compiler as TVM Compiler<br/>Pipeline
    participant MarkPass as MarkCudaSyncCalls<br/>Pass
    participant Codegen as CUDA<br/>Codegen
    participant NVRTC as NVRTC<br/>Compiler
    participant Runtime as CUDA<br/>Runtime

    User->>Kernel: Define kernel with<br/>pdl_trigger()/pdl_sync()
    Kernel->>Compiler: Lower to TVM IR<br/>(extern calls)
    Compiler->>MarkPass: OptimizeForTarget<br/>phase
    MarkPass->>MarkPass: Detect extern calls<br/>by name
    MarkPass->>Compiler: Annotate with<br/>kHasTriggerLaunch/<br/>kHasGridSync
    Compiler->>Codegen: Code generation<br/>phase
    alt Has kHasGridSync
        Codegen->>Codegen: Suppress __restrict__<br/>qualifiers
        Codegen->>Codegen: Validate no __ldg__<br/>usage
    end
    Codegen->>NVRTC: CUDA kernel source<br/>(with extern calls)
    NVRTC->>NVRTC: Inject PDL header<br/>(if needed)
    NVRTC->>NVRTC: Compile to PTX
    User->>Runtime: Launch kernel
    Runtime->>Runtime: Check pdl_sync_map<br/>for function
    alt Function in PDL map
        Runtime->>Runtime: Use programmatic<br/>launch attributes
    end
    Runtime->>Runtime: Execute kernel<br/>with PDL support
    Runtime-->>User: Results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

  • [Feature Request] PDL Support #1463: Main PDL support issue directly addressed by all changes—implements public pdl_trigger and pdl_sync helpers, detection and annotation passes, runtime plumbing, and comprehensive tests as requested.

Possibly related PRs

  • [CUDA] Add read-only parameter annotation for CUDA codegen #1416: Both PRs modify PrintFunctionSignature and AddFunction paths in src/target/codegen_cuda.cc to conditionally alter kernel parameter qualifiers; this PR gates on has_cuda_pdl_sync while that PR manages const/readonly parameters, creating a merge-order dependency.

Suggested reviewers

  • LeiWang1999

🐰 A rabbit hops through TVM's grand design,
PDL triggers now make kernels sync,
CUDA waits for launches to complete,
SM_90 makes the rhythm sweet!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.81% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feat] PDL Support' directly and clearly describes the main change: adding Programmatic Dependent Launch (PDL) support across the codebase.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24117e6 and c1cdf7c.

📒 Files selected for processing (1)
  • testing/python/jit/test_tilelang_jit_nvrtc.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • testing/python/jit/test_tilelang_jit_nvrtc.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)

151-153: Consider truncating lib_code in error message.

Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting self.lib_code from the error, or logging it separately at debug level.

🔎 Proposed improvement
         if ret.returncode != 0:
             command_str = " ".join(command)
-            raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}")
+            raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")
testing/python/jit/test_tilelang_jit_nvrtc.py (1)

505-552: Consider adding a compute capability check for PDL support.

Based on tilelang/contrib/nvcc.py (lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, like test_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.

🔎 Proposed fix
 def test_nvrtc_pdl():
     """Test pdl."""
+    if not check_hopper():
+        import pytest
+
+        pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support")
 
     N = 64
tilelang/jit/adapter/wrapper.py (1)

213-213: Consider using set instead of dict for pdl_sync_map.

The pdl_sync_map dictionary always stores 0 as the value (line 450). If the value is unused, a set would be more appropriate and clearer in intent.

🔎 Proposed change
-        self.pdl_sync_map: dict[str, int] | None = {}
+        self.pdl_sync_map: set[str] | None = set()

And update usages:

-            if "has_cuda_pdl_sync" in attrs:
-                self.pdl_sync_map[function_name] = 0
+            if "has_cuda_pdl_sync" in attrs:
+                self.pdl_sync_map.add(function_name)
tilelang/transform/__init__.py (1)

403-406: LGTM with suggestion: Consider expanding the docstring.

The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the have_pdl parameter and return type.

🔎 Suggested docstring enhancement
 def MarkCudaSyncCalls(have_pdl: bool = False):
-    """MarkCudaSyncCalls"""
+    """Mark CUDA synchronization calls for PDL support.
+
+    Parameters
+    ----------
+    have_pdl : bool
+        Whether the target supports PDL (Programmatic Device Launch).
+        PDL is available on compute capability >= 9.0 (Hopper+).
+
+    Returns
+    -------
+    fpass : tvm.transform.Pass
+        The result pass
+    """
     return _ffi_api.MarkCudaSyncCalls(have_pdl)  # type: ignore
tilelang/contrib/nvcc.py (1)

590-596: Minor: Prefix unused variable with underscore.

The minor variable from parse_compute_version is unpacked but never used. This is a valid static analysis hint.

Proposed fix
 def have_pdl(target):
     if target.kind.name != "cuda":
         return False
     compute_version = get_target_compute_version(target)
-    major, minor = parse_compute_version(compute_version)
+    major, _ = parse_compute_version(compute_version)
     return major >= 9
testing/python/jit/test_tilelang_jit_ctypes.py (1)

1-1: Remove unused import.

The tvm import is not used in this test file.

Proposed fix
-from tilelang import tvm as tvm
 import tilelang.language as T
src/transform/lower_pdl.cc (2)

83-83: Typo in class name: ElininateCudaSyncCallsEliminateCudaSyncCalls.

The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.

Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator {
+class EliminateCudaSyncCalls : public StmtExprMutator {
 public:
   static PrimFunc Substitute(PrimFunc f) {
-    ElininateCudaSyncCalls mutator;
+    EliminateCudaSyncCalls mutator;
     PrimFunc new_f = f;
     new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body);
 
     return new_f;
   }
   // ... rest of class ...
 private:
-  ElininateCudaSyncCalls() = default;
+  EliminateCudaSyncCalls() = default;
 };

Also update line 141:

   return have_pdl ? MarkCudaSyncCalls::Substitute(f)
-                  : ElininateCudaSyncCalls::Substitute(f);
+                  : EliminateCudaSyncCalls::Substitute(f);

57-78: Duplicate private: access specifier.

There are two private: access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.

Proposed fix
-private:
   void CheckCall(const tir::CallNode *call) {
     if (!call)
       return;
     // ... implementation ...
   }

-private:
   bool has_trigger_launch_ = false;
   bool has_grid_sync_ = false;

   MarkCudaSyncCalls() = default;
+
+private:
+  void CheckCall(const tir::CallNode *call) {
+    if (!call)
+      return;
+    // ... implementation ...
+  }
+
+  bool has_trigger_launch_ = false;
+  bool has_grid_sync_ = false;
+
+  MarkCudaSyncCalls() = default;
 };

Note: The Cppcheck "syntax error" on line 147 is a false positive — TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for FFI registration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a431797 and 2a67387.

📒 Files selected for processing (13)
  • docs/programming_guides/instructions.md
  • src/transform/lower_pdl.cc
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/language/test_tilelang_language_pdl.py
  • tilelang/contrib/nvcc.py
  • tilelang/engine/phase.py
  • tilelang/jit/adapter/libgen.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
  • tilelang/jit/adapter/wrapper.py
  • tilelang/language/__init__.py
  • tilelang/language/pdl.py
  • tilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
  • MarkCudaSyncCalls (80-80)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/transform/__init__.py (1)
  • MarkCudaSyncCalls (403-405)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
  • test_nvrtc_pdl (505-552)
  • multi_kernels_with_pdl (511-532)
  • main (30-50)
  • main (138-158)
  • main (384-413)
  • main (513-530)
  • kernel (469-484)
testing/python/language/test_tilelang_language_pdl.py (2)
  • main (7-16)
  • main (23-32)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
  • parse_compute_version (179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
  • parse_function_call_args (315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
  • lib_code (287-289)
tilelang/jit/adapter/cython/adapter.py (1)
  • lib_code (371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🪛 Ruff (0.14.8)
tilelang/language/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/language/pdl.py

4-7: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/contrib/nvcc.py

594-594: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tilelang/jit/adapter/libgen.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)

141-155: LGTM: PDL kernel launch code structure looks correct.

The KERNEL_LAUNCH_FUNC_CODE correctly configures cudaLaunchConfig_t with cudaLaunchAttributeProgrammaticStreamSerialization and uses cudaLaunchKernelEx for the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.


313-330: LGTM: PDL branch in dispatch function is correctly structured.

The new PDL path properly:

  1. Parses function call arguments
  2. Validates argument count matches parameters
  3. Uses KERNEL_LAUNCH_FUNC_CODE for kernel launch with PDL attributes
  4. Includes error checking after launch

448-450: LGTM: PDL attribute detection correctly populates the sync map.

The has_cuda_pdl_sync attribute is checked and the function is registered in pdl_sync_map for later use during code generation.

tilelang/language/__init__.py (1)

119-123: LGTM!

The PDL function re-exports follow the established pattern in this file. The # noqa: F401 comments are appropriate for suppressing "imported but unused" warnings in __init__.py files intended to expose public API.

docs/programming_guides/instructions.md (1)

141-144: LGTM!

Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.

tilelang/engine/phase.py (1)

255-257: LGTM!

The MarkCudaSyncCalls pass is correctly integrated after SplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.

testing/python/language/test_tilelang_language_pdl.py (1)

37-61: Test implementation looks good.

The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the MarkCudaSyncCalls pass behavior.

src/transform/lower_pdl.cc (1)

29-43: LGTM on the MarkCudaSyncCalls implementation.

The mutator correctly scans for CUDA sync extern calls and annotates the PrimFunc with has_cuda_pdl_trigger and has_cuda_pdl_sync attributes. The pattern of visiting both statements and expressions ensures comprehensive detection.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)

9-9: Remove duplicate include.

The header tvm/tir/builtin.h is included twice (line 9 and line 14).

🔎 Proposed fix
 #include "../op/builtin.h"
 #include "../target/utils.h"
 #include "tvm/ir/type.h"
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>

Also applies to: 14-14


58-74: Extract duplicated logic into a helper function.

The CheckCall methods in both MarkCudaSyncCalls (lines 58-74) and EliminateCudaSyncCalls (lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.

🔎 Proposed refactor

Add a helper function in an anonymous namespace:

namespace {
// Returns true if the call is a CUDA PDL sync call
bool IsPDLSyncCall(const tir::CallNode *call) {
  if (!call)
    return false;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
               func_name == "cudaGridDependencySynchronize";
      }
    }
  }
  return false;
}

// Returns the specific PDL sync type (0=none, 1=trigger, 2=sync)
int GetPDLSyncType(const tir::CallNode *call) {
  if (!call)
    return 0;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
          return 1;
        } else if (func_name == "cudaGridDependencySynchronize") {
          return 2;
        }
      }
    }
  }
  return 0;
}
} // anonymous namespace

Then update both classes to use these helpers.

Also applies to: 111-130


138-145: Consider more descriptive pass name.

The pass name "tl.MarkCudaSyncCalls" only mentions "Mark" behavior, but the pass can also eliminate calls when have_pdl is false. Consider a name like "tl.ProcessCudaSyncCalls" or document the dual behavior clearly.

tilelang/jit/adapter/nvrtc/wrapper.py (1)

479-486: Docstring enhancement suggestion (optional):

The docstring is functional but could be improved to match the detail level of generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure of pdl_sync_map, and when/why this returns non-empty code:

    def generate_pdl_sync_code(self, function_name: str) -> str:
-        """
-        Generate Python code to insert PDL synchronization for a given kernel.
-        """
+        """Generate Python code to configure PDL synchronization for a kernel.
+
+        PDL (Programmatic Device Launch) enables programmatic stream serialization
+        for fine-grained kernel synchronization on compatible CUDA architectures.
+
+        Args:
+            function_name: Kernel name to check for PDL sync config
+
+        Returns:
+            Python code that sets programmatic stream serialization attributes,
+            or empty string if PDL sync is not configured for this kernel.
+        """
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f50aa99 and cb052f4.

📒 Files selected for processing (2)
  • src/transform/lower_pdl.cc
  • tilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)

29-81: LGTM!

The MarkCudaSyncCalls class correctly implements the TVM StmtExprMutator pattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.


147-151: LGTM! Static analysis false positive can be ignored.

The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro, which is a valid TVM construct.

tilelang/jit/adapter/nvrtc/wrapper.py (3)

172-186: LGTM! Past config overwriting issue has been resolved.

The PDL synchronization code correctly assumes the config object already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder {11}) ensures this code executes after config = CUlaunchConfig() is created at line 198.


207-207: LGTM! PDL sync placeholder correctly positioned.

The placeholder {11} is correctly placed after config creation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.


423-424: LGTM! PDL sync code correctly integrated.

The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the {11} placeholder in KERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.

Also applies to: 438-438

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

58-74: Consider extracting duplicated CheckCall logic.

The logic for detecting CUDA PDL function names is duplicated between MarkCudaSyncCalls::CheckCall and EliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.

🔎 Proposed refactor to extract common helper

Add a helper function in the anonymous namespace:

+namespace {
+
+enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync };
+
+CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) {
+  if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) {
+    return CudaPdlCallType::kNone;
+  }
+  
+  if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
+    std::string func_name = str_node->value;
+    if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+      return CudaPdlCallType::kTriggerLaunch;
+    } else if (func_name == "cudaGridDependencySynchronize") {
+      return CudaPdlCallType::kGridSync;
+    }
+  }
+  return CudaPdlCallType::kNone;
+}
+
+} // namespace
+
 class MarkCudaSyncCalls : public StmtExprMutator {

Then simplify both CheckCall methods:

 private:
   void CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return;
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            has_grid_sync_ = true;
-          }
-        }
-      }
+    auto call_type = GetCudaPdlCallType(call);
+    if (call_type == CudaPdlCallType::kTriggerLaunch) {
+      has_trigger_launch_ = true;
+    } else if (call_type == CudaPdlCallType::kGridSync) {
+      has_grid_sync_ = true;
     }
   }

And in EliminateCudaSyncCalls:

 private:
   bool CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return false;
-
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            return true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            return true;
-          }
-        }
-      }
-    }
-
-    return false;
+    return GetCudaPdlCallType(call) != CudaPdlCallType::kNone;
   }

Also applies to: 111-130


138-145: Pass name doesn't reflect dual behavior.

The pass is named "tl.MarkCudaSyncCalls" but it performs two different operations based on the have_pdl parameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like "tl.HandleCudaSyncCalls" or "tl.ProcessCudaSyncCalls" to better reflect this dual behavior.

🔎 Proposed rename
-  return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {});
+  return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb052f4 and 466c94a.

📒 Files selected for processing (1)
  • src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)

147-151: Static analysis syntax error is a false positive.

The Cppcheck syntax error reported at line 147 is a false positive. The TVM_FFI_STATIC_INIT_BLOCK macro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.

@w169q169
Copy link
Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine.
The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there.
If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

@w169q169
Copy link
Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604
Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)

41-43: Consider optimizing output tensor initialization.

Tensors b and c are initialized with torch.randn but are immediately overwritten by the kernel. Consider using torch.empty instead for better performance.

🔎 Proposed optimization
 a = torch.randn(N, dtype=torch.float32).cuda()
-b = torch.randn(N, dtype=torch.float32).cuda()
-c = torch.randn(N, dtype=torch.float32).cuda()
+b = torch.empty(N, dtype=torch.float32).cuda()
+c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 466c94a and 05b8dc0.

📒 Files selected for processing (1)
  • testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
  • threads (214-218)
tilelang/language/loop.py (1)
  • Parallel (13-33)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/utils/tensor.py (1)
  • torch_assert_close (231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)

1-6: LGTM - Imports are correct.

All necessary imports are present for the PDL test functionality.


8-11: LGTM - Test setup is appropriate.

The test function is properly defined with a reasonable test size for validating basic PDL functionality.


14-35: LGTM - PDL usage and kernel logic are correct.

The PDL primitives are used correctly:

  • T.pdl_trigger() is called after the first kernel completes its computation, signaling completion
  • T.pdl_sync() is called before the second kernel begins processing, ensuring proper synchronization

The kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).


48-55: LGTM - Test execution and verification are correct.

The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.


58-59: LGTM - Main guard is correctly implemented.

The main guard follows the standard pattern for tilelang test files.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604 Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding has_programmatic_dependent_launch soon after that pr is merged.

@LeiWang1999
Copy link
Member

Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tilelang/language/__init__.py (1)

119-122: LGTM! Consider removing unnecessary noqa directives.

The PDL function imports are correctly formatted and consistent with the existing pattern in this file. However, Ruff indicates that the F401 check is not enabled in your configuration, making the # noqa: F401 directives unnecessary on lines 120-121.

Optional cleanup
 from .pdl import (
-    pdl_trigger,  # noqa: F401
-    pdl_sync,  # noqa: F401
+    pdl_trigger,
+    pdl_sync,
 )

Note: If you prefer to keep the noqa directives for consistency with the rest of the file or in case F401 is enabled in the future, that's also reasonable.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46fe83f and 1c75231.

📒 Files selected for processing (3)
  • docs/programming_guides/instructions.md
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • tilelang/language/__init__.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/programming_guides/instructions.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (1)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
🪛 Ruff (0.14.10)
tilelang/language/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)

3336-3337: Consider consolidating or clarifying the duplicate comment.

The comment "remove restrict due to NVCC's bug" appears twice (lines 3336 and 3423). Additionally, while the implementation correctly suppresses __restrict__ annotations when PDL grid sync is present, the comment could be more specific about what NVCC bug is being worked around. This would aid future maintainers in understanding when this workaround can be removed.

Suggested comment improvement
-  // remove __restrict__ due to NVCC's bug
   bool has_cuda_pdl_sync = func->HasNonzeroAttr(tl::attr::kHasGridSync);
+  // NVCC has issues with __restrict__ on kernel parameters when using PDL (Programmatic Dependent Launch)
+  // synchronization. Suppress the annotation when kHasGridSync is set.

Also applies to: 3387-3387

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c75231 and 7ed6748.

📒 Files selected for processing (2)
  • src/target/codegen_cuda.cc
  • src/transform/warp_specialized_rewriter.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/warp_specialized_rewriter.cc (1)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
🔇 Additional comments (2)
src/transform/warp_specialized_rewriter.cc (1)

167-178: LGTM: PDL synchronization calls correctly classified as dual-role.

The implementation properly identifies PDL-related extern calls (cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) and classifies them as both producer and consumer roles, which is appropriate for synchronization primitives. The null checks and string casting are correctly handled.

src/target/codegen_cuda.cc (1)

3387-3387: Consistent handling of __restrict__ guard across kernel parameter paths is verified.

The has_cuda_pdl_sync guard on __restrict__ appears in exactly two code paths in codegen_cuda.cc: PrintFunctionSignature (line 3387) and AddFunction (line 3476). Both use identical guarding logic (!has_cuda_pdl_sync && no_alias && !non_restrict.count(v.get())), ensuring consistent handling throughout CUDA kernel parameter emission. No other code paths emit __restrict__ annotations without this guard.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

46-50: Verify: CheckLDGCalls analyzes original body, not mutated body.

CheckLDGCalls is invoked on f->body (original) rather than new_f->body (after mutation). Since MarkCudaSyncCalls doesn't structurally modify the IR (only annotates the function), this should be equivalent. However, for clarity and defensive coding, consider using the mutated body:

     if (mutator.has_grid_sync_) {
       new_f = WithAttr(std::move(new_f), attr::kHasGridSync, 1);
       CheckLDGCalls analyzer;
-      analyzer(f->body);
+      analyzer(new_f->body);
     }

60-76: Consider extracting duplicate CheckCall logic.

Both MarkCudaSyncCalls and EliminateCudaSyncCalls contain nearly identical logic to identify PDL sync calls by name. Consider extracting to a shared helper function.

🔎 Suggested refactor
namespace {
// Returns: 0 = not a PDL call, 1 = trigger launch, 2 = grid sync
int ClassifyPdlCall(const tir::CallNode *call) {
  if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty())
    return 0;
  if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
    if (str_node->value == "cudaTriggerProgrammaticLaunchCompletion") return 1;
    if (str_node->value == "cudaGridDependencySynchronize") return 2;
  }
  return 0;
}
}  // namespace

Also applies to: 103-123

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ed6748 and 8978b11.

📒 Files selected for processing (2)
  • src/target/codegen_cuda.cc
  • src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_pdl.cc
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 140-140: syntax error

(syntaxError)

🔇 Additional comments (8)
src/target/codegen_cuda.cc (3)

17-17: LGTM!

The include for the attribute definitions is correctly added to support the PDL-related attribute check.


3336-3338: LGTM – PDL gating for __restrict__ suppression.

The logic correctly suppresses __restrict__ annotations when kHasGridSync is set, addressing the documented NVCC incompatibility with PDL. The comment clearly explains the rationale.

Also applies to: 3388-3388


3424-3426: LGTM – Consistent with PrintFunctionSignature.

The same PDL gating logic is correctly replicated in AddFunction. This duplication follows the existing pattern for no_alias and non_restrict handling in both methods.

Also applies to: 3478-3478

src/transform/lower_pdl.cc (5)

1-17: LGTM!

File header and includes are correctly structured for the PDL pass implementation.


26-34: LGTM – __ldg incompatibility check.

The visitor correctly enforces the constraint that __ldg cannot be used with PDL synchronization due to the NVCC bug documented in the PR commit message.


95-101: LGTM – PDL call elimination.

Replacing PDL calls with make_zero(op->dtype) is correct for eliminating these calls when PDL is unsupported. The CUDA error type (cudaError_t) is integer-based, so zero (cudaSuccess) is a valid sentinel.


131-138: LGTM – Pass factory correctly dispatches based on have_pdl.

The pass correctly chooses between marking (when PDL is supported) and elimination (when PDL is not supported) based on the have_pdl parameter, which is determined by compute capability ≥ 9.0 per the have_pdl function in nvcc.py.


140-144: Static analysis false positive – TVM_FFI_STATIC_INIT_BLOCK is valid.

The Cppcheck syntax error is a false positive. TVM_FFI_STATIC_INIT_BLOCK() is a TVM macro that expands to valid C++ code for FFI registration. The registration correctly exposes the pass under tl.transform.MarkCudaSyncCalls.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_nvrtc.py (1)

506-556: Test implementation looks correct.

The test properly:

  • Skips on unsupported hardware
  • Defines a multi-kernel PDL workflow with T.pdl_trigger() and T.pdl_sync()
  • Verifies both intermediate (B = A + 1.0) and final (C = B * 2.0) results

Minor suggestion: Since b and c are output tensors that will be overwritten by the kernel, consider using torch.empty() instead of torch.randn() on lines 543-544 to avoid unnecessary initialization.

🔎 Optional: Use torch.empty() for output tensors
     # Create test tensors
     a = torch.randn(N, dtype=torch.float32).cuda()
-    b = torch.randn(N, dtype=torch.float32).cuda()
-    c = torch.randn(N, dtype=torch.float32).cuda()
+    b = torch.empty(N, dtype=torch.float32).cuda()
+    c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8978b11 and af5f26d.

📒 Files selected for processing (4)
  • src/target/codegen_cuda.cc
  • src/transform/lower_pdl.cc
  • testing/python/jit/test_tilelang_jit_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_cuda.cc
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_pdl.cc
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_cython.py
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (2)
testing/python/jit/test_tilelang_jit_nvrtc.py (3)
testing/python/jit/test_tilelang_jit_cython.py (3)
  • check_pdl (9-14)
  • multi_kernels_with_pdl (26-47)
  • main (28-45)
tilelang/jit/adapter/nvrtc/adapter.py (1)
  • prim_func (267-269)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
testing/python/jit/test_tilelang_jit_cython.py (2)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/utils/tensor.py (1)
  • torch_assert_close (231-319)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 92-92: syntax error

(syntaxError)

🔇 Additional comments (7)
src/transform/lower_pdl.cc (3)

24-34: LGTM!

The CheckLDGCalls visitor correctly detects incompatible __ldg usage with PDL synchronization and provides a clear fatal error message.


59-73: LGTM!

The extern call detection logic correctly identifies the CUDA PDL function names by inspecting StringImmNode arguments and sets the appropriate flags.


92-96: The TVM_FFI_STATIC_INIT_BLOCK() syntax is correct and follows the standard TVM FFI registration pattern used throughout the codebase. The static analysis tool warning is a false positive. This exact pattern appears in 50+ files across the repository (e.g., src/ir.cc, src/transform/annotate_device_regions.cc, src/transform/cluster_planning.cc, etc.) with no compilation issues.

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

498-503: LGTM!

The check_pdl() function correctly checks for compute capability >= 9, consistent with the implementation in test_tilelang_jit_cython.py.

testing/python/jit/test_tilelang_jit_cython.py (3)

9-14: LGTM: Correct compute capability check for PDL.

The helper correctly validates that CUDA is available and the device supports compute capability 9.0 or higher, which is required for PDL.


25-47: LGTM: PDL synchronization pattern is correctly implemented.

The kernel definition properly demonstrates PDL usage:

  • First kernel triggers PDL after computing intermediate results (B = A + 1.0)
  • Second kernel synchronizes before consuming those results (C = B * 2.0)

The execution_backend="cython" parameter correctly targets the Cython backend as validated in the PR discussion.


49-65: LGTM: Test execution and verification are sound.

The test correctly:

  • Compiles the PDL-enabled kernel
  • Allocates test tensors and computes reference results
  • Executes the kernel and verifies outputs with appropriate tolerances

The verification would catch synchronization failures (e.g., if PDL synchronization were broken, the second kernel might read stale data from B, causing assertion failures).

Copy link
Collaborator

@silentCoder-dev silentCoder-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, we only need to implement the extension for the TVM FFI backend.

@silentCoder-dev silentCoder-dev dismissed their stale review December 25, 2025 02:47

This issue is resolved

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

10-10: Remove duplicate include.

tvm/tir/builtin.h is included on both line 10 and line 15. Remove one of them.

🔎 Proposed fix
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>

Also applies to: 15-15


59-73: Consider extracting magic strings to named constants.

The function name strings "cudaTriggerProgrammaticLaunchCompletion" and "cudaGridDependencySynchronize" are hardcoded. While acceptable, extracting them to named constants would improve maintainability.

🔎 Proposed refactor

At the top of the file or in the class:

namespace {
constexpr const char* kCudaTriggerPDL = "cudaTriggerProgrammaticLaunchCompletion";
constexpr const char* kCudaGridSync = "cudaGridDependencySynchronize";
}

Then update the checks:

           std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+          if (func_name == kCudaTriggerPDL) {
             has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
+          } else if (func_name == kCudaGridSync) {
             has_grid_sync_ = true;
           }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af5f26d and c123fd7.

📒 Files selected for processing (2)
  • src/transform/lower_pdl.cc
  • tilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/engine/phase.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_pdl.cc
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 92-92: syntax error

(syntaxError)

🔇 Additional comments (3)
src/transform/lower_pdl.cc (3)

24-34: LGTM!

The CheckLDGCalls visitor correctly detects __ldg calls and enforces the constraint that they cannot be used with PDL synchronization.


38-57: LGTM!

The transformation logic is correct:

  • Properly checks PDL support and fails if PDL functions are used when unsupported
  • Correctly annotates the function with attributes based on detected PDL calls
  • Runs CheckLDGCalls on the original body to detect incompatible __ldg usage

84-96: LGTM! Cppcheck syntax error is a false positive.

The pass creation and FFI registration follow standard TVM patterns and are implemented correctly. The Cppcheck syntax error reported at line 92 is a false positive—TVM_FFI_STATIC_INIT_BLOCK() is a TVM macro that expands to valid C++ code, but Cppcheck doesn't understand the macro expansion.

@w169q169
Copy link
Author

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

I have thought about this concern, and my reasoning is as follows.

cudaGridDependencySynchronize() and
cudaTriggerProgrammaticLaunchCompletion()
are not just semantic markers; they lower to concrete device instructions whose placement in the instruction stream matters.

For example, in the following kernel:

__global__ void square(int* array, int n) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < n)
        array[tid] = array[tid] * array[tid];
        cudaGridDependencySynchronize();
        array[tid] = array[tid] * array[tid];
}

cudaGridDependencySynchronize() is lowered to specific SASS instructions (BSYNC B0 and ACQBULK), which are emitted at precise locations in the control flow:

# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org

...
 STG.E desc[UR4][R2.64], R5 
.L_x_1:
 BSYNC B0 
.L_x_0:
 ACQBULK 
 LDG.E R0, desc[UR4][R2.64] 
 IMAD R5, R0, R0, RZ 
 STG.E desc[UR4][R2.64], R5 
 EXIT 
...

This positional dependency is the key reason I am cautious about aggressively running lower_pdl when have_pdl is not supported.

If we model PDL via tir.call_extern, then I agree this introduces extra passes whose sole purpose is to eliminate these calls on non-CUDA targets. That is redundant, but at least it preserves exact placement semantics until codegen.

I also considered an alternative using block.attr, similar to T.annotate_l2_hit_ratio. This avoids enabling a CUDA-specific lowering pass on non-CUDA targets. However, the drawback is that we lose precise location information. For example:

with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        if idx < N:
            C[idx] = B[idx] * 2.0
    T.pdl_sync()
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        T.pdl_sync()
        if idx < N:
            C[idx] = B[idx] * 2.0

In this case, pdl_sync becomes attached to the T.Kernel block, but its exact position relative to surrounding instructions is no longer explicit, which is problematic for something that lowers to synchronization instructions.

One possible direction is to avoid call_extern and instead use attr, then explicitly materialize the corresponding device-side calls during codegen for each backend. However, the trade-off is that this approach still requires backend-specific handling and is conceptually similar in cost to eliminating function calls via passes.

These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance.

@silentCoder-dev
Copy link
Collaborator

silentCoder-dev commented Dec 25, 2025

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

I have thought about this concern, and my reasoning is as follows.

cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() are not just semantic markers; they lower to concrete device instructions whose placement in the instruction stream matters.

For example, in the following kernel:

__global__ void square(int* array, int n) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < n)
        array[tid] = array[tid] * array[tid];
        cudaGridDependencySynchronize();
        array[tid] = array[tid] * array[tid];
}

cudaGridDependencySynchronize() is lowered to specific SASS instructions (BSYNC B0 and ACQBULK), which are emitted at precise locations in the control flow:

# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org

...
 STG.E desc[UR4][R2.64], R5 
.L_x_1:
 BSYNC B0 
.L_x_0:
 ACQBULK 
 LDG.E R0, desc[UR4][R2.64] 
 IMAD R5, R0, R0, RZ 
 STG.E desc[UR4][R2.64], R5 
 EXIT 
...

This positional dependency is the key reason I am cautious about aggressively running lower_pdl when have_pdl is not supported.

If we model PDL via tir.call_extern, then I agree this introduces extra passes whose sole purpose is to eliminate these calls on non-CUDA targets. That is redundant, but at least it preserves exact placement semantics until codegen.

I also considered an alternative using block.attr, similar to T.annotate_l2_hit_ratio. This avoids enabling a CUDA-specific lowering pass on non-CUDA targets. However, the drawback is that we lose precise location information. For example:

with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        if idx < N:
            C[idx] = B[idx] * 2.0
    T.pdl_sync()
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        T.pdl_sync()
        if idx < N:
            C[idx] = B[idx] * 2.0

In this case, pdl_sync becomes attached to the T.Kernel block, but its exact position relative to surrounding instructions is no longer explicit, which is problematic for something that lowers to synchronization instructions.

One possible direction is to avoid call_extern and instead use attr, then explicitly materialize the corresponding device-side calls during codegen for each backend. However, the trade-off is that this approach still requires backend-specific handling and is conceptually similar in cost to eliminating function calls via passes.

These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance.

Testing on SM_80 reveals that NVCC fails due to the griddepcontrol instruction requiring SM_90+. To ensure consistent behavior, the lower_pdl pass has been updated to throw the same error. In my view, addressing this hardware limitation (by targeting a compatible architecture) is a user responsibility; the pass should not automatically eliminate PDL calls as a workaround.

A similar pattern can be observed with the __ldg intrinsic. For example, in testing/python/language/test_tilelang_language_intrinsics_codegen.py, targeting a non-CUDA backend like "metal" results in the error: "tvm.error.InternalError: Unresolved call Op(tl.__ldg)".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 877962b and 22455f2.

📒 Files selected for processing (3)
  • tilelang/contrib/nvrtc.py
  • tilelang/jit/adapter/nvrtc/libgen.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • tilelang/jit/adapter/nvrtc/libgen.py
🔇 Additional comments (6)
tilelang/contrib/nvrtc.py (1)

73-76: LGTM! Simple and effective conditional header inclusion for PDL support.

The string-based detection for cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion is pragmatic. The header is correctly inserted after the standard NVRTC header, ensuring the device runtime API declarations are available when PDL intrinsics are used.

tilelang/jit/adapter/nvrtc/wrapper.py (5)

45-46: Good refactor: imports moved to module-level template.

Moving CUlaunchAttribute and CUlaunchAttributeID imports to PREDEF_HOST_FUNC_PY avoids redundant imports per kernel and keeps the generated code cleaner.


174-182: LGTM! Config overwrite issue resolved.

The template now correctly modifies the existing config object created in KERNEL_LAUNCH_FUNC_PY rather than creating a new one. The programmatic stream serialization attribute is properly configured.


203-203: Correct placement of PDL sync code injection.

The placeholder {11} is positioned after config initialization but before cuLaunchKernelEx, ensuring PDL attributes are applied to the launch configuration at the right time.


419-434: Clean integration with existing two-pass code generation.

The PDL sync code generation follows the same pattern as L2 persistent map handling, maintaining consistency with the existing per-kernel configuration approach.


475-482: No changes needed. pdl_sync_map is properly initialized in the parent class TLCUDASourceWrapper at line 213 as self.pdl_sync_map: dict[str, int] | None = {}, and populated at line 450 when kernels have the has_cuda_pdl_sync attribute. The method correctly accesses this inherited attribute.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)

7-7: LGTM: Module-level pytest import is appropriate.

The module-level pytest import supports the new PDL test and is best practice. As a follow-up, consider removing the redundant local import pytest at line 377 in test_tvm_ffi_im2col_tma_desc().


446-451: LGTM: Compute capability check is correct for PDL.

The function correctly gates PDL tests on compute_capability[0] >= 9, aligning with the SM_90+ requirement discussed in the PR comments. Note that an identical check_pdl() helper exists in test_tilelang_jit_nvrtc.py. Consolidating these into a shared test utility (e.g., tilelang.testing) could reduce duplication across test modules.


454-505: LGTM: PDL test correctly exercises trigger/sync semantics.

The test accurately validates PDL synchronization by having the first kernel trigger completion after writing B, and the second kernel sync before reading B. The PDL API usage aligns with the definitions in tilelang/language/pdl.py, and the result validation is thorough.

Optional refinements:

  • Test coverage: The current test uses N=64 with block_size=256, launching only a single block per kernel. Consider adding a test case with a larger N (e.g., 1024 or 2048) to verify PDL behavior across multiple thread blocks.
  • Tensor allocation: Lines 491–492 initialize output tensors b and c with torch.randn() but the kernel immediately overwrites them. Using torch.empty() would be slightly more efficient.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22455f2 and 2f83ea7.

📒 Files selected for processing (3)
  • 3rdparty/tvm
  • src/target/rt_mod_cuda.cc
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
✅ Files skipped from review due to trivial changes (1)
  • 3rdparty/tvm
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)

1-1: LGTM: Includes are necessary for PDL support.

The new includes provide the attribute definitions (kHasGridSync) and runtime metadata (kUseProgramaticDependentLaunch) required by the PDL detection logic at lines 43-46.

Also applies to: 4-4

Comment on lines 43 to 46
if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) {
info.launch_param_tags.push_back(
runtime::launch_param::kUseProgramaticDependentLaunch);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for both spelling variants and the constant definition

echo "=== Searching for 'Programatic' (single-m) ==="
rg 'Programatic' -C 2

echo -e "\n=== Searching for 'Programmatic' (double-m) ==="
rg 'Programmatic' -C 2

echo -e "\n=== Checking the specific constant definition ==="
rg 'kUseProgramatic' -C 3

Repository: tile-ai/tilelang

Length of output: 6726


The constant name contains a spelling error that should be corrected.

The constant kUseProgramaticDependentLaunch uses single-m "Programatic", but all CUDA API functions (cudaTriggerProgrammaticLaunchCompletion), attributes (cudaLaunchAttributeProgrammaticStreamSerialization), and comments throughout the codebase consistently use double-m "Programmatic". Rename the constant to kUseProgrammaticDependentLaunch for consistency.

🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 43 to 46, the constant name
kUseProgramaticDependentLaunch is misspelled (single “m”); rename all
uses/definitions to kUseProgrammaticDependentLaunch to match CUDA naming and
existing codebase conventions, update any declaration, definition, and
references (including headers and enums) to the corrected identifier, and run a
project-wide search/replace to ensure consistency and rebuild to catch any
remaining references.

Copy link
Collaborator

@silentCoder-dev silentCoder-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@LeiWang1999 LeiWang1999 self-requested a review December 26, 2025 09:39
@LeiWang1999
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

\t\tconfig.gridDim = {0};
\t\tconfig.blockDim = {1};
\t\tconfig.dynamicSmemBytes = {2};
\t\tcudaLaunchKernelEx(&config, {4}, {3});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pack kernel args for cudaLaunchKernelEx

cudaLaunchKernelEx expects its third parameter to be a void** array of pointers to the kernel arguments, but in the PDL path {3} is populated from call_args = ", ".join(args_list) (see the PDL branch in this file). That means the generated call passes raw arguments instead of an argument pointer array, so the runtime will interpret the first kernel argument as the void** array (and the next as extra), leading to invalid parameter packing or crashes for any has_cuda_pdl_sync kernel launched via the C++ wrapper.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@silentCoder-dev silentCoder-dev Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaLaunchKernelExC uses a void** for its arguments, while cudaLaunchKernelEx employs a template to accept ArgsTypes&&... args.

@kurisu6912 kurisu6912 linked an issue Dec 29, 2025 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] PDL Support

3 participants