Translate MmaOp patterns properly on Hopper #4072

jacobhinkle · 2025-03-13T17:27:38Z

#3986 fixes our most common use cases of MatmulOp and LinearOp translation on Hopper. It does so by scheduling global intermediates' mtypes and allocation domains during translation. However, in case there is no translation and we are already given an MmaOp this fails. The current PR instead does mtype and allocation domain propagation while caching operands, so that we can properly set as_ and bs_ and so forth. This means that the input fusions don't need to differ between hopper and ampere anymore, so we can translate both cases in the same way and the only difference will be during scheduling.

Note that this will also make it easier to maintain internal tooling which uses things like canonicalizeInputToBMNK.

Changes in this PR:

Remove avoid_intermediates argument to MatmulPattern::translateToMmaOp and update all call sites.
Remove some helper utilities in mma_utils.cpp
Introduce scheduler_utils::scheduleInputToSkipIntermediates which will schedule the allocation domains and memory types of consumers of inputs recursively to avoid "metadata ops" at the beginning of a fusion.
Rearrange HopperMultipleMatmulScheduler to remove defineOperandCaches and move cacheInputsAndOutputs after pattern translation but before findRoles. Also cacheInputsAndOutputs now uses scheduler_utils::scheduleInputToSkipIntermediates and defines the operand roles as the last gmem tensor returned by that utility.
Unguards AllocationDomainTest.BasicMatmul/* to allow it to run on Hopper

…owering

TODO: Squeeze, permute, chain tests

Failing at computing TMA descriptor now

Build errors on clang

jacobhinkle · 2025-03-18T12:40:28Z

!test

…tterns

jacobhinkle · 2025-03-18T18:37:58Z

Currently some tests fail to compile because I am calling scheduler_utils::scheduleInputToSkipIntermediates even on Ampere. This should work, but it exposes some bugs in the current tensor producer alias system. A better system would not simply compute the index based on the skipped tensor but would allow us to do src indexing on tensors that are not direct producers instead. I think this is something that TensorIndexer could support, but I could not figure out how to do it currently without modifying TensorIndexer. So for now, I plan to skip this call in MultipleMatmulScheduler::cacheInputsAndOutputs upon request (i.e. on Ampere).

EDIT: see the bool skip_intermediates argument to cacheInputsAndOutputs.

jacobhinkle · 2025-03-18T18:48:31Z

!test --diff

…tterns

This is not needed. We never directly reference dc anyway.

jacobhinkle · 2025-03-19T16:29:26Z

!test --diff

jacobhinkle · 2025-03-19T16:30:24Z

csrc/scheduler/ampere_multi_matmul.cpp

@@ -467,19 +467,19 @@ void AmpereMultipleMatmulScheduler::validate() const {
 }

 void AmpereMultipleMatmulScheduler::run() {
-  // Clears memory spaces on intermediate tensors, calls


Changes to Ampere scheduler should not change generated code, but do let us use a common cacheInputsAndOutputs method.

jacobhinkle · 2025-03-19T16:31:16Z

csrc/scheduler/ampere_multi_matmul.cpp

+    for (Val* dv : fusion_->outputs()) {
+      auto* d = dv->as<TensorView>();


dc is ignored anyway. We long ago stopped using cached_outputs_ in the Hopper scheduler, so I removed it for Ampere too as it was causing a problem due to the refactor not filling that vector.

jacobhinkle · 2025-03-19T16:32:22Z

csrc/scheduler/mma_utils.cpp

@@ -1789,177 +1789,6 @@ std::string MatmulPattern::toString() const {

 namespace {

-// Check whether tv has all the output_groups in its logical domain, and


These utilities are no longer needed since we can now safely translate all matmul patterns the same way on both Hopper and Ampere. The differences are purely in downstream scheduling.

jacobhinkle · 2025-03-19T16:33:20Z

tests/cpp/test_matmul_scheduler.cpp

@@ -2492,7 +2492,7 @@ class MatmulSchedulerPluginTest : public NVFuserTest {

 // Test that our fake plugin works to override the default heuristic
 TEST_F(MatmulSchedulerPluginTest, BasicMatmul) {
-  NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 9, 0);
+  NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 10, 0);


More tests can likely be unguarded once we update their params or set a fixture to create default sets of parameters.

Fixes the horizontal fusion tests

This can be made to work but currently the config factory generates an invalid config.

jacobhinkle · 2025-03-19T20:00:37Z

!test --diff

…tterns

jacobhinkle · 2025-03-20T12:54:16Z

!test --diff

…tterns

jacobhinkle · 2025-03-25T15:45:42Z

csrc/scheduler/hopper_multi_matmul.cpp

 void HopperMultipleMatmulScheduler::run() {
-  // Clears memory spaces on intermediate tensors, calls
-  // cache{After,Before,Fork} on inputs and outputs
-  cacheInputsAndOutputs();
-
  // Finds matmul patterns and translates them to MmaOps, then finds tensor
  // and dimension roles for all tensors in the fusion
  findPatterns();
  translatePatterns();
-  findRoles();

+  // Clears memory spaces on intermediate tensors, calls
+  // cache{After,Before,Fork} on inputs and outputs.
  // Defines acw_smem/bcw_smem and acr/bcr by possibly calling cacheAfter.
-  // This also collects mma_results_
-  defineOperandCaches();
+  cacheInputsAndOutputs(/*skip_intermediates=*/true);
+
+  // We wait until we are done caching tensors to find roles, since this
+  // requires building an IdModel, which would not be updated during the cache
+  // calls.
+  findRoles();

  inspectPrologues();


Rearranged to not cache until after translation of patterns. This is helpful because it lets us cache the global tensors that have been "skipped" with producer tensor aliases, instead of the original fusion inputs.

…tterns

jacobhinkle added 30 commits February 5, 2025 16:32

Add test

cd189a8

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

5a65fa4

…owering

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

1c938dc

…owering

Switch to G->G

be2d161

Fix bug in aliasTensorProducer. Test passes!

ccbee27

Remove isTrivialExpr from fusion_simplifier.cpp

d80e349

Update comment

a60527b

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

b9c9091

…owering

Don't alias consumers if they are fusion outputs

36a9377

Do actual replacement in lowerSrcIndex

bf477e8

Check that allocation domains are exact mapped

0467377

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

793d09f

…owering

Add test that we have no global intermediates

2920c52

Add test file with Broadcast test

926d869

TODO: Squeeze, permute, chain tests

Add Squeeze test

5f19420

Fix reference chasing, add BroadcastSqueeze test

2396310

Add Permute test

865b6a6

Make findTensorProducerAliases an analysis pass

c1bfa46

Respect aliases in sync insertion pass

ed2e415

Start converting MatmulOp case

4488d8c

Use ValGroup instead of trying to be fancy

18fd310

Fix up translation of MatmulOp

7e2e7b5

Failing at computing TMA descriptor now

Use alias when computing TMA descriptor

f22abbd

Fix by using cacheAfter instead of set

9649fe2

Use old2new. Use helper for LinearOp

e06ff43

Unguard GEMM tests. Investigating others...

f58fe1b

Unguard epilogue fusion test

019e1e5

Replace pattern.A/B after translation to fix horizontal fusion

9859459

Hold off on using std::view::ranges::iota for now

0cb07b3

Build errors on clang

Fix comment on translateToMmaOp

7d0db8a

jacobhinkle added 2 commits March 18, 2025 11:15

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

e1b58fd

…tterns

Fix CountCasts test

6af7ace

Move cacheInputsAndOutputs to MultipleMatmulScheduler

c3d9bb7

jacobhinkle added 4 commits March 19, 2025 07:29

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

229cf03

…tterns

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

966c547

…tterns

Stop using cached_outputs_ in ampere scheduler

86d746f

This is not needed. We never directly reference dc anyway.

Make fields part of MultipleMatmulScheduler when possible

c6dcf01

jacobhinkle commented Mar 19, 2025

View reviewed changes

jacobhinkle changed the title ~~[WIP] Translate MmaOp patterns properly on Hopper~~ Translate MmaOp patterns properly on Hopper Mar 19, 2025

jacobhinkle added 3 commits March 19, 2025 15:38

Reuse transformed operands in translateToMmaOp

8a23a0d

Fixes the horizontal fusion tests

Re-guard MatmulSchedulerPluginTest.BasicMatmul

566585a

This can be made to work but currently the config factory generates an invalid config.

Properly check prologues in getMatmulCompileTimeRejectReason

da9480a

jacobhinkle and others added 4 commits March 19, 2025 20:01

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

8f0f4be

…tterns

Fix bug in getUnsqueeze. Use output rank instead of input rank

ed73182

Fix llama ffn horizontal fusion on Ampere

8aa9c57

Merge branch 'main' into jh/translate_mmaop_patterns

2b49b4c

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

73dc294

…tterns

jacobhinkle marked this pull request as ready for review March 25, 2025 15:42

jacobhinkle commented Mar 25, 2025

View reviewed changes

jacobhinkle requested a review from rdspring1 March 25, 2025 15:55

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

71afba8

…tterns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate MmaOp patterns properly on Hopper #4072

Translate MmaOp patterns properly on Hopper #4072

jacobhinkle commented Mar 13, 2025 •

edited

Loading

jacobhinkle commented Mar 18, 2025

jacobhinkle commented Mar 18, 2025 •

edited

Loading

jacobhinkle commented Mar 18, 2025

jacobhinkle commented Mar 19, 2025

jacobhinkle Mar 19, 2025

jacobhinkle Mar 19, 2025

jacobhinkle Mar 19, 2025

jacobhinkle Mar 19, 2025

jacobhinkle commented Mar 19, 2025

jacobhinkle commented Mar 20, 2025

jacobhinkle Mar 25, 2025

		for (Val* dv : fusion_->outputs()) {
		auto* d = dv->as<TensorView>();

		@@ -1789,177 +1789,6 @@ std::string MatmulPattern::toString() const {

		namespace {

		// Check whether tv has all the output_groups in its logical domain, and

Translate MmaOp patterns properly on Hopper #4072

Are you sure you want to change the base?

Translate MmaOp patterns properly on Hopper #4072

Conversation

jacobhinkle commented Mar 13, 2025 • edited Loading

jacobhinkle commented Mar 18, 2025

jacobhinkle commented Mar 18, 2025 • edited Loading

jacobhinkle commented Mar 18, 2025

jacobhinkle commented Mar 19, 2025

jacobhinkle Mar 19, 2025

Choose a reason for hiding this comment

jacobhinkle Mar 19, 2025

Choose a reason for hiding this comment

jacobhinkle Mar 19, 2025

Choose a reason for hiding this comment

jacobhinkle Mar 19, 2025

Choose a reason for hiding this comment

jacobhinkle commented Mar 19, 2025

jacobhinkle commented Mar 20, 2025

jacobhinkle Mar 25, 2025

Choose a reason for hiding this comment

jacobhinkle commented Mar 13, 2025 •

edited

Loading

jacobhinkle commented Mar 18, 2025 •

edited

Loading