[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

DomBrown · 2025-06-13T16:31:28Z

Description

Autotuner integration for pytorch workflow, TRTLLM Gen FP8 blockscale MoE

Test Coverage

tests/unittest/_torch/thop/test_moe.py

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

… kernel autotuner Signed-off-by: Dom Brown <[email protected]>

Signed-off-by: Dom Brown <[email protected]>

DomBrown · 2025-06-13T16:32:42Z

/bot run

tensorrt-cicd · 2025-06-13T16:38:15Z

PR_Github #8836 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-13T16:54:33Z

PR_Github #8836 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6423 completed with status: 'FAILURE'

Signed-off-by: Dom Brown <[email protected]>

DomBrown · 2025-06-13T19:10:18Z

/bot run

tensorrt-cicd · 2025-06-13T19:16:38Z

PR_Github #8848 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-13T23:08:20Z

PR_Github #8848 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6435 completed with status: 'SUCCESS'

hyukn · 2025-06-14T15:18:39Z

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

+
+    kernel_runner = FP8BlockScaleMoERunner(tile_tokens_dim)
+
+    inputs = [


My guess is that not all the scalar values are considered to be put in the cache key, but we still need them for the kernel_runner. Only the scale value value tile_token_dims matters.
This method actually works because we treated them as size(0) tensor in the cache key. But the original design for the input list is only for tensor data.
My suggestion is to do the following:

Still put all the attributes in the runner for the usage in forward and get_valid_tactics.

Override __hash__ to put only value tile_token_dims in the cache key.

Only pass tensors with the inputs.

To clarify, do you mean to pass the scalar values to the constructor of FP8BlockScaleMoERunner and store them in there? That's simple enough to do

I may also need to make them part of the instance_key for the runner dict if I do that.

hyukn · 2025-06-14T15:32:15Z

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu

    return std::make_tuple(workspace_size_fc1, workspace_size_fc2);
 }

-void Runner::run(MoERunnerArgs const& args, MoEWorkspace const& workspace, int device, cudaStream_t stream)
+std::vector<int64_t> Runner::getValidConfigIndices(
+    int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numLocalExperts, int32_t numTokens) const


Does the list of valid configs depend on the scalar values like topK? If so, I think we should put the value in the cache key in case of wrong cache reuse.

I will check. If it does I will do as you suggest

hyukn · 2025-06-14T15:44:27Z

tests/unittest/_torch/thop/test_moe.py

+    if use_autotune:
+        with autotune():
+            output = torch.ops.trtllm.fp8_block_scale_moe_runner(
+                expert_logits, routing_bias, hidden_states, hidden_states_scale,
+                gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
+                num_experts, top_k, n_groups, top_k_groups, intermediate_size,
+                0, num_experts, routed_scaling, tile_tokens_dim,
+                routing_method_type)
+    else:
+        output = torch.ops.trtllm.fp8_block_scale_moe_runner(
+            expert_logits, routing_bias, hidden_states, hidden_states_scale,
+            gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
+            num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0,
+            num_experts, routed_scaling, tile_tokens_dim, routing_method_type)


Suggested change

if use_autotune:

with autotune():

output = torch.ops.trtllm.fp8_block_scale_moe_runner(

expert_logits, routing_bias, hidden_states, hidden_states_scale,

gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,

num_experts, top_k, n_groups, top_k_groups, intermediate_size,

0, num_experts, routed_scaling, tile_tokens_dim,

routing_method_type)

else:

output = torch.ops.trtllm.fp8_block_scale_moe_runner(

expert_logits, routing_bias, hidden_states, hidden_states_scale,

gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,

num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0,

num_experts, routed_scaling, tile_tokens_dim, routing_method_type)

with autotune(use_autotune):

output = torch.ops.trtllm.fp8_block_scale_moe_runner(

expert_logits, routing_bias, hidden_states, hidden_states_scale,

gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,

num_experts, top_k, n_groups, top_k_groups, intermediate_size,

0, num_experts, routed_scaling, tile_tokens_dim,

routing_method_type)

Copilot

Pull Request Overview

This PR integrates the FP8 block‐scale MoE kernel into the PyTorch autotuning workflow and updates tests and C++ kernels accordingly.

Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
Extends the unit test to run both autotuned and non-autotuned code paths.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unittest/_torch/thop/test_moe.py	Parametrized `test_moe_fp8` to toggle autotune and wrapped call in `with autotune()`
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py	New custom-op module with `FP8BlockScaleMoERunner` and registration
tensorrt_llm/_torch/custom_ops/init.py	Exported `fp8_block_scale_moe_runner` in `__all__`
tensorrt_llm/_torch/autotuner.py	Generalized non-tensor handling in `choose_one` input shapes
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp	Added `moeConfigIndex` to workspace and run calls
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp	Wrapped kernel in a Torch custom class and threaded `configIndex`
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h	Added methods to enumerate and validate config indices
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu	Propagated `configIndex` into CUDA runner and workspace sizing
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h	Removed optional `configIndex`, now mandatory
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp	Updated all overloads to take an `int32_t configIndex`

Comments suppressed due to low confidence (4)

tests/unittest/_torch/thop/test_moe.py:578

[nitpick] It may help readability and debugging to include ids for all parametrized arguments (e.g., also name the hidden_size and intermediate_size cases), so test reports clearly show which parameters were used for each invocation.

@pytest.mark.parametrize("use_autotune", [True, False],

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py:35

[nitpick] Consider adding a class-level docstring for FP8BlockScaleMoERunner to describe its role and parameters, which will improve discoverability and maintenance.

class FP8BlockScaleMoERunner(TunableRunner):

tensorrt_llm/_torch/autotuner.py:344

[nitpick] The comment could mention that this branch also covers scalar (non-Tensor) arguments, not just optional inputs, for clarity.

# Treat non-tensors as size zero

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h:125

[nitpick] Since the new getWorkspaceSizeInBytes now requires a configIndex, consider providing an overload or a default value to preserve backward compatibility for callers that do not supply this argument.

size_t getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numExperts,

DomBrown added 2 commits June 13, 2025 16:11

feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow…

4ef7d71

… kernel autotuner Signed-off-by: Dom Brown <[email protected]>

Fix pre commit

2715254

Signed-off-by: Dom Brown <[email protected]>

DomBrown requested review from nekorobov and hyukn June 13, 2025 16:31

DomBrown self-assigned this Jun 13, 2025

DomBrown requested a review from a team as a code owner June 13, 2025 16:31

DomBrown requested a review from hlu1 June 13, 2025 16:31

Fix build error

9f479d2

Signed-off-by: Dom Brown <[email protected]>

hyukn reviewed Jun 14, 2025

View reviewed changes

DomBrown requested a review from Copilot June 14, 2025 17:31

Copilot AI reviewed Jun 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

DomBrown commented Jun 13, 2025

Uh oh!

DomBrown commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

DomBrown commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

hyukn Jun 14, 2025

Uh oh!

DomBrown Jun 14, 2025

Uh oh!

DomBrown Jun 14, 2025

Uh oh!

hyukn Jun 14, 2025

Uh oh!

DomBrown Jun 14, 2025

Uh oh!

hyukn Jun 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!


		kernel_runner = FP8BlockScaleMoERunner(tile_tokens_dim)

		inputs = [

[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

Are you sure you want to change the base?

[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

Conversation

DomBrown commented Jun 13, 2025

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

DomBrown commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

DomBrown commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

hyukn Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

DomBrown Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

DomBrown Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

hyukn Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

DomBrown Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

hyukn Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!