auto select between warp specialized and multi-wave approaches #4603

liqiangxl · 2025-06-09T19:12:53Z

Summary

This PR introduces automatic selection between warp specialized and multi-wave approaches for normalization kernels in the InnerOuterPersistentKernelScheduler. The scheduler now intelligently chooses the optimal approach based on hardware capabilities, input characteristics, and workload properties.

Key Changes

1. Enhanced Heuristic Selection Logic

The getInnerOuterPersistentHeuristics() function now implements a two-tier selection strategy:

User-Requested Selection

When EnableOption::WarpSpecializedNormalization is explicitly enabled, the scheduler unconditionally uses the warp specialized approach
This provides users with direct control over the scheduling strategy

Automatic Heuristic-Based Selection

The scheduler automatically generates warp specialized heuristics when all the following conditions are met:

GPU Architecture ≥ 10: Multi-wave approach works well on Hopper GPUs, so warp specialization is only auto enabled for Blackwell and later architectures
Concretized Input Tensors: Current implementation requires static shapes for warp specialization (dynamic inputs are not yet supported)
Sufficient Iteration Domain Size:
- RMS Norm Backward: Requires > 4 × SM_count rows per SM
- Layer Norm Backward: Requires > 16 × SM_count rows per SM
- This ensures deep circular buffering and amortizes weight tensor (shared in different batches) loading overhead

2. Fallback Mechanism

No fallback if EnableOption::WarpSpecializedNormalization is explicitly enabled.
For auto generated, will fallback to multi-wave approach if is_good_ws_heuristic() returns false. The following heuristics are considered 'bad'.
- Single Stage Detection: If n_stages == 1, the heuristic cannot achieve circular buffering and is rejected
- Register Spill Prevention: if bdimy == 1 && is_non_circular_buffer_gmem_to_regs, Ping-pong is not used and the heuristic aims to reduce shared memory usage by loading data directly from global memory to registers. This increases register pressure and may lead to register spills. The method is considered beneficial only when there are at least 64 non-buffer registers available, to avoid excessive spilling. This threshold is based on empirical results from RMSNorm backward pass in FP16 on B200, with a practical cut-off around a hidden size of 24K.

3. Implementation Details

New Function: preferWarpSpecialized() implements the heuristic logic for automatic selection
Enhanced Parameters: SchedulerHyperParameters now includes is_warp_specialized flag
Graceful Degradation: Failed warp specialized attempts seamlessly transition to multi-wave scheduling

Performance Impact

…unrolled case

…nto cherry-pick-circular-buffer-params

…gpong

Co-authored-by: Ryan Spring <[email protected]>

…nvidia/fuser into llu/ws_tma_pingpong_static_warp

Co-authored-by: Ryan Spring <[email protected]>

…ng_heu2

### Scheduler changes (1) `TIDy` is used to parallelize independent computation warp groups. (2) Revise codegen, ensure different warp groups use different reduction/broadcast workspaces and sync barriers. (3) Other minor changes, e.g. avoid unroll output tensor to save registers, smem buffer size should consider iter grouped number, Unroll "prefetch" is disabled for non-matmul computation branch to avoid instruction cache missing, similar to #3818 ### Heuristic changes **General idea:** Optimize register and shared memory usage to achieve multiple independent compute warp groups, unrolled iteration domains, and deep circular buffering. **Still a rough version just ensures correctness. Will be fine tuned considering other fusions and auto select warp specilized approach or multi-wave approach, e.g. #4603 **Key paras:** Four ints `bdimx`: used to parallelize inner dim, e.g. 128, 256. Influence register usage `bdimy`: used for warp specialization and independent compute warp groups, e.g. 1, 2, 3 `iter_unroll`: unroll factor of iteration dim, e.g. 1, 2, 4 `n_stages`: circular buffer stages, 2, 4, 8 Two bools `bool is_circular_buffer_regs_cached`: Cache TMA loaded buffer to regs `bool is_non_circular_buffer_gmem_to_regs`: Directly load non-circular buffered tv from gmem to regs **Logic to update key paras in func update_heuristics()** Start with `bdimx = 128, bdimy= 1, iter_unroll=1, n_stages=1`, loop until nothing is updated. (1) Try to increase `n_stages` to target, check shared memory usage, `n_stages` won't influence register usage. (2) Try to increase `bdimy` to target, check shared memory and register usage. (3) Try to increase `iter_unroll` to target, check shared memory and register usage. (3) If `bdimy==1`, increase `bdimx`, check shared memory usage **Workflow of the heuristics:** Call `update_heuristics()` with varied configurations. (1) Initial attempt: `is_circular_buffer_regs_cached = true` `is_non_circular_buffer_gmem_to_regs = true` `target_stages = 2, target_bdimy = 2, target_iter_unroll = 2` (2) First fallback when `bdimy == 1`, reduce register usage by set: `is_circular_buffer_regs_cached = false`, `target_iter_unroll = 1` (3) Second fallback when `bdimy == 1`, further reduce register usage by set: `is_non_circular_buffer_gmem_to_regs=false` (4) Last fallback when `n_stages = 1`, reduce shared memory to achieve circular buffering by set `is_non_circular_buffer_gmem_to_regs=true` At last, further increase `target_stages` if there are unused shared memory. ### Performance & Influence of different paras: **After this PR** ![image](https://github.com/user-attachments/assets/cb07cf2c-f8b0-4d9d-9e08-e676049e463c) After #4599 ![image](https://github.com/user-attachments/assets/4d5807bd-367a-4e2e-b397-f3afc129009a) **Para analysis:** See pages 1 to 7 in this [slide](https://docs.google.com/presentation/d/1Z_4c8dhzy_4Px5WfQ1-zP_uq8PohbYVQojjX2SmSkhY/edit?usp=sharing). --------- Co-authored-by: jjsjann123 <[email protected]>

liqiangxl · 2025-06-18T18:56:25Z

!test

liqiangxl · 2025-06-24T15:08:02Z

!test

jjsjann123

perf looks amazing. 👏

I'm wondering if I mis-read the heuristic logic, or if the benchmark is measured using the right scheduling scheme.

jjsjann123 · 2025-06-26T00:12:20Z

csrc/scheduler/utils.h

@@ -848,6 +848,8 @@ TensorView* getUpCastInputOf(const TensorView* buffer_tv);
 //! See device_lower/analysis/tensor_producer_aliases.h
 TensorView* scheduleInputToSkipIntermediates(TensorView* tv);

+// Returns true if any of the domains of the tensor is symbolic
+bool isConcreteTensor(const TensorView* tv);


nitpick, maybe we can just call this SymbolicTensor? since the code comment is saying that already.

I was initially using isSymbolicTensor(), but then noticed that "symbolic" has a different meaning in IterType::Symbolic. So I switched to isConcreteTensor. After your suggestion, it seems reasonable to go back to using isSymbolicTensor(), since we're checking a tensor not an iter domain.

We also use the term "symbolic" in IterType::Symbolic, which refers to a temporary state during fusion definition and compilation. This state is later resolved to either IterType::Iteration or IterType::Broadcast during concretization.
In this case, we're explicitly referring to SymbolicTensor, so it should be fine.

I see what you mean. yeah, concretization took out so many good names 😆

Thanks for bearing with my nitpicking.

jjsjann123 · 2025-06-26T00:14:08Z

csrc/scheduler/normalization_inner_outer.cpp

+  // static CTA size
+  auto inp_tvs = ir_utils::filterByType<TensorView>(fusion->inputs());
+  if (std::any_of(inp_tvs.begin(), inp_tvs.end(), [](TensorView* tv) {
+        return scheduler_utils::isConcreteTensor(tv);


The logic looks wrong to me.

Suggested change

return scheduler_utils::isConcreteTensor(tv);

return !scheduler_utils::isConcreteTensor(tv);

Changed to isSymbolicTensor, so no need to change the logic here.

The logic was wrong, becuase I changed the function name from isSymbolicTensor to isConcreteTensor without actually changing the logic. Now we are changing back to isSymbolicTensor.

csrc/scheduler/normalization_inner_outer.cpp

jjsjann123 · 2025-06-26T00:20:54Z

csrc/scheduler/normalization_inner_outer.cpp

-        runtime_info.getIndexType());
+
+    // If warp specialized is enabled, or the heuristic is successful, return
+    if (hp.is_warp_specialized || rparams->is_good_ws_heuristic) {


out of curiosity, what happens when we have hp.is_warp_specialized set as true, but we failed to get a good heuristics?

It leads to poor performance but still gives correct results. Useful for comparing the performance of different approaches.

Co-authored-by: jjsjann123 <[email protected]>

jjsjann123

LGTM

liqiangxl and others added 30 commits June 5, 2025 11:01

enable pingpong

00b2495

Merge branch 'main' into llu/ws_tma_pingpong

2982bc2

Merge branch 'main' into llu/ws_tma_pingpong

931ae45

add getGroupedReductionPersistentTvs

27792d6

fix

3c99371

Merge branch 'main' into llu/ws_tma_inline

5eae38b

v2, target_iter_unroll = 4

69ed760

v3, don't unroll compute warp

61f6fd1

v4, fall back if iter_unroll == 1

c43a240

v5, revert-4, don't unroll io tvs, add static warp reduction for not …

084c567

…unrolled case

v6, without pingpong

944a896

v7, bdimx

9b9622e

v7, fix

42ea74d

v7, fix2

76b0940

v6, try to avoid stages = 1

bb08d2f

fix

098e783

Merge branch 'main' into llu/ws_tma_pingpong_static_warp

43c663c

Merge branch 'main' into llu/ws_tma_inline

bdaafa9

rebase

8d231c0

rebase

7c46885

fix

03eac2a

add two paras control smem and regs usage

28d16f2

format

55837aa

remove from inner_outer_utils

2e73300

Merge branch 'fix/move-getOuterBroadcastTvs-to-normalization-utils' i…

5f38a30

…nto cherry-pick-circular-buffer-params

Merge branch 'cherry-pick-circular-buffer-params' into llu/ws_tma_pin…

c49c79f

…gpong

fix missed change

2fb1378

Merge branch 'cherry-pick-circular-buffer-params' into llu/ws_tma_pin…

5f6a122

…gpong

rebase

e0f320a

add static warp all reduce

46539a7

liqiangxl and others added 7 commits June 16, 2025 06:32

Merge branch 'main' into llu/ws_tma_pingpong

ceb294a

revise test

524a800

fix get compute threads, tests

f4b8ea2

Merge branch 'llu/ws_tma_pingpong' into llu/ws_tma_pingpong_static_warp

9a79c6f

Update csrc/device_lower/analysis/fused_reduction.cpp

6b9615c

Co-authored-by: Ryan Spring <[email protected]>

Merge branch 'llu/ws_tma_pingpong_static_warp' of https://github.com/…

ba5bad5

…nvidia/fuser into llu/ws_tma_pingpong_static_warp

revise FusedReductionBroadcastInfo

99eaabc

liqiangxl mentioned this pull request Jun 16, 2025

add static warp all reduce #4599

Merged

liqiangxl and others added 6 commits June 16, 2025 08:07

comments

0c968cc

add comments

e067010

Update csrc/device_lower/analysis/fused_reduction.cpp

7d93cd3

Co-authored-by: Ryan Spring <[email protected]>

revise FusedReductionBroadcastInfo

7d44a66

ensure only one reduction id

2129350

Merge branch 'llu/ws_tma_pingpong_static_warp' into llu/ws_tma_pingpo…

36da7e5

…ng_heu2

Base automatically changed from llu/ws_tma_pingpong_static_warp to main June 18, 2025 18:16

merge main

0d9ef00

Merge branch 'main' into llu/ws_tma_pingpong_heu2

db20767

liqiangxl changed the base branch from main to llu/ws_tma_lnopt June 24, 2025 15:05

Merge branch 'llu/ws_tma_lnopt' into llu/ws_tma_pingpong_heu2

62473e8

Merge branch 'llu/ws_tma_lnopt' into llu/ws_tma_pingpong_heu2

a316778

liqiangxl marked this pull request as ready for review June 24, 2025 18:16

liqiangxl requested a review from jjsjann123 June 24, 2025 18:16

jjsjann123 reviewed Jun 26, 2025

View reviewed changes

liqiangxl mentioned this pull request Jun 26, 2025

fix warp specialized tma for ln bwd #4663

Open

liqiangxl and others added 2 commits June 26, 2025 10:45

Update csrc/scheduler/normalization_inner_outer.cpp

dbfff73

Co-authored-by: jjsjann123 <[email protected]>

rename to isSymbolicTensor

ad64a0e

jjsjann123 approved these changes Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

auto select between warp specialized and multi-wave approaches #4603

auto select between warp specialized and multi-wave approaches #4603

Uh oh!

liqiangxl commented Jun 9, 2025 •

edited

Loading

Uh oh!

liqiangxl commented Jun 18, 2025

Uh oh!

liqiangxl commented Jun 24, 2025

Uh oh!

jjsjann123 left a comment

Uh oh!

jjsjann123 Jun 26, 2025

Uh oh!

liqiangxl Jun 26, 2025

Uh oh!

jjsjann123 Jun 26, 2025

Uh oh!

jjsjann123 Jun 26, 2025

Uh oh!

liqiangxl Jun 26, 2025

Uh oh!

liqiangxl Jun 26, 2025

Uh oh!

Uh oh!

jjsjann123 Jun 26, 2025

Uh oh!

liqiangxl Jun 26, 2025

Uh oh!

jjsjann123 left a comment

Uh oh!

Uh oh!

	return scheduler_utils::isConcreteTensor(tv);
	return !scheduler_utils::isConcreteTensor(tv);

auto select between warp specialized and multi-wave approaches #4603

Are you sure you want to change the base?

auto select between warp specialized and multi-wave approaches #4603

Uh oh!

Conversation

liqiangxl commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Enhanced Heuristic Selection Logic

User-Requested Selection

Automatic Heuristic-Based Selection

2. Fallback Mechanism

3. Implementation Details

Performance Impact

Uh oh!

liqiangxl commented Jun 18, 2025

Uh oh!

liqiangxl commented Jun 24, 2025

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liqiangxl commented Jun 9, 2025 •

edited

Loading