-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… kernel autotuner Signed-off-by: Dom Brown <[email protected]>
Signed-off-by: Dom Brown <[email protected]>
/bot run |
PR_Github #8836 [ run ] triggered by Bot |
PR_Github #8836 [ run ] completed with state |
Signed-off-by: Dom Brown <[email protected]>
/bot run |
PR_Github #8848 [ run ] triggered by Bot |
PR_Github #8848 [ run ] completed with state |
|
||
kernel_runner = FP8BlockScaleMoERunner(tile_tokens_dim) | ||
|
||
inputs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that not all the scalar values are considered to be put in the cache key, but we still need them for the kernel_runner. Only the scale value value tile_token_dims
matters.
This method actually works because we treated them as size(0) tensor in the cache key. But the original design for the input list is only for tensor data.
My suggestion is to do the following:
- Still put all the attributes in the runner for the usage in
forward
andget_valid_tactics
. - Override
__hash__
to put onlyvalue tile_token_dims
in the cache key. - Only pass tensors with the inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, do you mean to pass the scalar values to the constructor of FP8BlockScaleMoERunner
and store them in there? That's simple enough to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may also need to make them part of the instance_key
for the runner dict if I do that.
return std::make_tuple(workspace_size_fc1, workspace_size_fc2); | ||
} | ||
|
||
void Runner::run(MoERunnerArgs const& args, MoEWorkspace const& workspace, int device, cudaStream_t stream) | ||
std::vector<int64_t> Runner::getValidConfigIndices( | ||
int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numLocalExperts, int32_t numTokens) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the list of valid configs depend on the scalar values like topK? If so, I think we should put the value in the cache key in case of wrong cache reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check. If it does I will do as you suggest
if use_autotune: | ||
with autotune(): | ||
output = torch.ops.trtllm.fp8_block_scale_moe_runner( | ||
expert_logits, routing_bias, hidden_states, hidden_states_scale, | ||
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales, | ||
num_experts, top_k, n_groups, top_k_groups, intermediate_size, | ||
0, num_experts, routed_scaling, tile_tokens_dim, | ||
routing_method_type) | ||
else: | ||
output = torch.ops.trtllm.fp8_block_scale_moe_runner( | ||
expert_logits, routing_bias, hidden_states, hidden_states_scale, | ||
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales, | ||
num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0, | ||
num_experts, routed_scaling, tile_tokens_dim, routing_method_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if use_autotune: | |
with autotune(): | |
output = torch.ops.trtllm.fp8_block_scale_moe_runner( | |
expert_logits, routing_bias, hidden_states, hidden_states_scale, | |
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales, | |
num_experts, top_k, n_groups, top_k_groups, intermediate_size, | |
0, num_experts, routed_scaling, tile_tokens_dim, | |
routing_method_type) | |
else: | |
output = torch.ops.trtllm.fp8_block_scale_moe_runner( | |
expert_logits, routing_bias, hidden_states, hidden_states_scale, | |
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales, | |
num_experts, top_k, n_groups, top_k_groups, intermediate_size, 0, | |
num_experts, routed_scaling, tile_tokens_dim, routing_method_type) | |
with autotune(use_autotune): | |
output = torch.ops.trtllm.fp8_block_scale_moe_runner( | |
expert_logits, routing_bias, hidden_states, hidden_states_scale, | |
gemm1_weights, gemm1_scales, gemm2_weights, gemm2_scales, | |
num_experts, top_k, n_groups, top_k_groups, intermediate_size, | |
0, num_experts, routed_scaling, tile_tokens_dim, | |
routing_method_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates the FP8 block‐scale MoE kernel into the PyTorch autotuning workflow and updates tests and C++ kernels accordingly.
- Adds a new Python custom op (
fp8_block_scale_moe_runner
) and aFP8BlockScaleMoERunner
class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a
configIndex
for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unittest/_torch/thop/test_moe.py | Parametrized test_moe_fp8 to toggle autotune and wrapped call in with autotune() |
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py | New custom-op module with FP8BlockScaleMoERunner and registration |
tensorrt_llm/_torch/custom_ops/init.py | Exported fp8_block_scale_moe_runner in __all__ |
tensorrt_llm/_torch/autotuner.py | Generalized non-tensor handling in choose_one input shapes |
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp | Added moeConfigIndex to workspace and run calls |
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp | Wrapped kernel in a Torch custom class and threaded configIndex |
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h | Added methods to enumerate and validate config indices |
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu | Propagated configIndex into CUDA runner and workspace sizing |
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h | Removed optional configIndex , now mandatory |
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp | Updated all overloads to take an int32_t configIndex |
Comments suppressed due to low confidence (4)
tests/unittest/_torch/thop/test_moe.py:578
- [nitpick] It may help readability and debugging to include
ids
for all parametrized arguments (e.g., also name the hidden_size and intermediate_size cases), so test reports clearly show which parameters were used for each invocation.
@pytest.mark.parametrize("use_autotune", [True, False],
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py:35
- [nitpick] Consider adding a class-level docstring for
FP8BlockScaleMoERunner
to describe its role and parameters, which will improve discoverability and maintenance.
class FP8BlockScaleMoERunner(TunableRunner):
tensorrt_llm/_torch/autotuner.py:344
- [nitpick] The comment could mention that this branch also covers scalar (non-Tensor) arguments, not just optional inputs, for clarity.
# Treat non-tensors as size zero
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h:125
- [nitpick] Since the new
getWorkspaceSizeInBytes
now requires aconfigIndex
, consider providing an overload or a default value to preserve backward compatibility for callers that do not supply this argument.
size_t getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numExperts,
Description
Autotuner integration for pytorch workflow, TRTLLM Gen FP8 blockscale MoE
Test Coverage
tests/unittest/_torch/thop/test_moe.py
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.