Skip to content

[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DomBrown
Copy link
Collaborator

Description

Autotuner integration for pytorch workflow, TRTLLM Gen FP8 blockscale MoE

Test Coverage

tests/unittest/_torch/thop/test_moe.py

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@DomBrown DomBrown requested review from nekorobov and hyukn June 13, 2025 16:31
@DomBrown DomBrown self-assigned this Jun 13, 2025
@DomBrown DomBrown requested a review from a team as a code owner June 13, 2025 16:31
@DomBrown DomBrown requested a review from hlu1 June 13, 2025 16:31
@DomBrown
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8836 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8836 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6423 completed with status: 'FAILURE'

Signed-off-by: Dom Brown <[email protected]>
@DomBrown
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8848 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8848 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6435 completed with status: 'SUCCESS'


kernel_runner = FP8BlockScaleMoERunner(tile_tokens_dim)

inputs = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that not all the scalar values are considered to be put in the cache key, but we still need them for the kernel_runner. Only the scale value value tile_token_dims matters.
This method actually works because we treated them as size(0) tensor in the cache key. But the original design for the input list is only for tensor data.
My suggestion is to do the following:

  • Still put all the attributes in the runner for the usage in forward and get_valid_tactics.
  • Override __hash__ to put only value tile_token_dims in the cache key.
  • Only pass tensors with the inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, do you mean to pass the scalar values to the constructor of FP8BlockScaleMoERunner and store them in there? That's simple enough to do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may also need to make them part of the instance_key for the runner dict if I do that.

return std::make_tuple(workspace_size_fc1, workspace_size_fc2);
}

void Runner::run(MoERunnerArgs const& args, MoEWorkspace const& workspace, int device, cudaStream_t stream)
std::vector<int64_t> Runner::getValidConfigIndices(
int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numLocalExperts, int32_t numTokens) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the list of valid configs depend on the scalar values like topK? If so, I think we should put the value in the cache key in case of wrong cache reuse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check. If it does I will do as you suggest

Comment on lines +635 to +648
if use_autotune:
with autotune():
output = torch.ops.trtllm.fp8_block_scale_moe_runner(
expert_logits, routing_bias, hidden_states, hidden_states_scale,
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
num_experts, top_k, n_groups, top_k_groups, intermediate_size,
0, num_experts, routed_scaling, tile_tokens_dim,
routing_method_type)
else:
output = torch.ops.trtllm.fp8_block_scale_moe_runner(
expert_logits, routing_bias, hidden_states, hidden_states_scale,
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0,
num_experts, routed_scaling, tile_tokens_dim, routing_method_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if use_autotune:
with autotune():
output = torch.ops.trtllm.fp8_block_scale_moe_runner(
expert_logits, routing_bias, hidden_states, hidden_states_scale,
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
num_experts, top_k, n_groups, top_k_groups, intermediate_size,
0, num_experts, routed_scaling, tile_tokens_dim,
routing_method_type)
else:
output = torch.ops.trtllm.fp8_block_scale_moe_runner(
expert_logits, routing_bias, hidden_states, hidden_states_scale,
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0,
num_experts, routed_scaling, tile_tokens_dim, routing_method_type)
with autotune(use_autotune):
output = torch.ops.trtllm.fp8_block_scale_moe_runner(
expert_logits, routing_bias, hidden_states, hidden_states_scale,
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales,
num_experts, top_k, n_groups, top_k_groups, intermediate_size,
0, num_experts, routed_scaling, tile_tokens_dim,
routing_method_type)

@DomBrown DomBrown requested a review from Copilot June 14, 2025 17:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates the FP8 block‐scale MoE kernel into the PyTorch autotuning workflow and updates tests and C++ kernels accordingly.

  • Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
  • Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
  • Extends the unit test to run both autotuned and non-autotuned code paths.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unittest/_torch/thop/test_moe.py Parametrized test_moe_fp8 to toggle autotune and wrapped call in with autotune()
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py New custom-op module with FP8BlockScaleMoERunner and registration
tensorrt_llm/_torch/custom_ops/init.py Exported fp8_block_scale_moe_runner in __all__
tensorrt_llm/_torch/autotuner.py Generalized non-tensor handling in choose_one input shapes
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp Added moeConfigIndex to workspace and run calls
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp Wrapped kernel in a Torch custom class and threaded configIndex
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h Added methods to enumerate and validate config indices
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu Propagated configIndex into CUDA runner and workspace sizing
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h Removed optional configIndex, now mandatory
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp Updated all overloads to take an int32_t configIndex
Comments suppressed due to low confidence (4)

tests/unittest/_torch/thop/test_moe.py:578

  • [nitpick] It may help readability and debugging to include ids for all parametrized arguments (e.g., also name the hidden_size and intermediate_size cases), so test reports clearly show which parameters were used for each invocation.
@pytest.mark.parametrize("use_autotune", [True, False],

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py:35

  • [nitpick] Consider adding a class-level docstring for FP8BlockScaleMoERunner to describe its role and parameters, which will improve discoverability and maintenance.
class FP8BlockScaleMoERunner(TunableRunner):

tensorrt_llm/_torch/autotuner.py:344

  • [nitpick] The comment could mention that this branch also covers scalar (non-Tensor) arguments, not just optional inputs, for clarity.
# Treat non-tensors as size zero

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h:125

  • [nitpick] Since the new getWorkspaceSizeInBytes now requires a configIndex, consider providing an overload or a default value to preserve backward compatibility for callers that do not supply this argument.
size_t getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numExperts,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants