[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128

tomeras91 · 2025-06-11T12:35:50Z

Description

Currently on main, the mamba2 block forward pass has some dynamic memory allocations and host-to-device copies, greatly hurting its performance. This PR improves performance by minimizing these memory operations:

pre-allocate the state_indices tensors on device instead of moving it to device during forward pass
introduce MambaMetadata class holding cu_seqlens and seq_idx needed for varlen batched prefill of the SSM op. Compute them from attn_metadata at the start of the model forward pass instead of doing it inside the Mamba2Mixer block of each mamba layer. This also means we create these tensors once and not multiple times in each layer.
Remove the redundant for-loop on request type (prefill/decode) in the Mamba2Mixer forward pass. Replace it with 2 if statements, checking if we have prefills / decodes in the current batch.

These changes removed many of the GPU bubbles present in the mamba forward block, as seen in these profiles. They lead to 40% latency reduction.

Benchmarks

The modifications in this PR improved Nemotron-H max throughput by ~15% and min prefill latency by ~40% without sacrificing quality:

MMLU results

Version	MMLU Score
Main	69.98
PR	69.98

Performance benchmarks

Benchmark setting:

Used genai-perf
Done on a single RTX6000
ISL / OSL = 223 / 140 (similar to ShareGPT statistics)
nax_batch_size = 64
Max throughput setting: --request-count 500 --concurrency 500 --warmup-request-count 10
Min latency setting: --request-count 10 --concurrency 1 --warmup-request-count 1

max throughput results:

Metric	TRTLLM main	PR	Change [%]
Request Latency [ms]	30,614.38	26,482.98	-13.5%
Time To First Token [ms]	25,107.02	21,563.04	-14.1%
Inter Token Latency [ms]	40.91	36.97	-9.6%
Output Token Throughput [tokens/sec]	1,235.88	1,400.71	+13.3%
Output Token Throughput Per User [tokens/sec/user]	24.84	27.68	+11.4%
Request Throughput [req/sec]	9.14	10.43	+14.1%

min latency results:

Metric	TRTLLM main	PR	Change [%]
Request Latency [ms]	3,283.65	2,362.60	-28.0%
Time To First Token [ms]	82.55	50.54	-38.8%
Inter Token Latency [ms]	24.58	17.77	-27.7%
Output Token Throughput [tokens/sec]	39.89	55.44	+39.0%
Output Token Throughput Per User [tokens/sec/user]	40.69	56.28	+38.3%
Request Throughput [req/sec]	0.30	0.42	+40.0%

…b/causal-conv1d Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

…ze in RMSNorm Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…kernels (similar to the tests in tests/unittest/_torch/thop/test_mamba_conv1d_op.py) Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…nels for better numerical stability and support for initial states + varlen batching (AKA continuous batching) Signed-off-by: Tomer Asida <[email protected]>

… prefill and decode kernels (similar to the tests in tests/unittest/_torch/thop/test_selective_scan_op.py) Signed-off-by: Tomer Asida <[email protected]>

…round) Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…tiple tensors and torch.cat Signed-off-by: Tomer Asida <[email protected]>

… Results in +25% throughput (1) call convolution and SSM explicitly so no need for special call to get conv states (2) same dtype for conv and ssm states (3) remove unused code - causal_conv1d_varlen_states, mamba_split_conv1d_scan_combined Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…ause instead of duplicating code Signed-off-by: Tomer Asida <[email protected]>

…g forward pass Signed-off-by: Tomer Asida <[email protected]>

…ng. conv weights are already in correct shape Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…RTLLM in-house mamba_conv1d kernel Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…. Use standard TRTLLM types and macros when needed Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…-LLM into fix-nemotron-h-warmup Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…de Nemotron-H forward pass. This is instead of preparing cu_seqlens and seq_idx in MambaCacheManager for better code separation. Also because in MambaCacheManager.prepare_resources(), attn_metadata is not updated yet and we need it to create cu_seqlens efficiently. This also makes it similar to the regular attn_metadata flow, creating it if needed and preparing it before forward pass. The difference is that for regular attention this is done in PyTorchModelEngine, and for mamba we do it inside the model forward, since hybrid models are still a special case and we want to isolate the relevant code Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

(they appear first in attn_metadata.seq_lens_cuda) Signed-off-by: Tomer Asida <[email protected]>

…Manager as self.cu_seqlens and self.seq_idx don't exist anymore Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

Copilot

Pull Request Overview

This PR optimizes the Mamba2Mixer prefill performance by reducing dynamic memory allocations and host-to-device copies. Key changes include:

Pre-allocating state indices on the correct device in the resource manager.
Introducing a new Mamba2Metadata class to pre-compute and store metadata for varlen batched prefill.
Refactoring the Mamba2Mixer forward pass to separate prefill and decode logic and eliminate a redundant loop.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tensorrt_llm/_torch/pyexecutor/resource_manager.py	Assigns the correct device for state indices to minimize host-to-device transfers.
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py	Refactors forward pass to split prefill and decode logic using new metadata and indices.
tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py	Introduces a metadata class to compute and hold cu_seqlens and sequence indices.
tensorrt_llm/_torch/models/modeling_nemotron_h.py	Integrates mamba metadata into layer forward passes for optimized processing.

Comments suppressed due to low confidence (2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py:598

Consider adding an inline comment explaining that the device is explicitly set using self.ssm_states.device to ensure correct GPU allocation, and verify that self.ssm_states is initialized prior to this call.

self.state_indices = torch.as_tensor(state_indices,

tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py:178

[nitpick] Consider adding a brief comment to clarify that state_indices is split into prefill and decode subsets based on the computed batch_split_size, which will aid future maintainers in understanding the code logic.

state_indices_p, state_indices_d = torch.split(state_indices, batch_split_size)

tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py

… prepare() is called Co-authored-by: Copilot <[email protected]> Signed-off-by: tomeras91 <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2025-06-12T07:51:41Z

/bot run

tomeras91 · 2025-06-12T11:01:24Z

/bot run

tensorrt-cicd · 2025-06-12T11:11:00Z

PR_Github #8670 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-12T20:47:56Z

PR_Github #8670 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6288 completed with status: 'SUCCESS'

tomeras91 added 30 commits May 7, 2025 20:13

import causal_conv1d kernels (prefill + decode) adapted from Dao-AILa…

8f46eb4

…b/causal-conv1d Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

use the new decode convolution kernel in mamba2 mixer

ac63678

Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Avoid deprecation warning - use normalized_shape instead of hidden_si…

f5cb0cd

…ze in RMSNorm Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

style and format

18b2851

Signed-off-by: Tomer Asida <[email protected]>

remove redundant file

b52bdf8

Signed-off-by: Tomer Asida <[email protected]>

Align torch op registration better with repo conventions

3997b4d

Signed-off-by: Tomer Asida <[email protected]>

no need to pass conv_state_indices if no paged mamba

d9cca9c

Signed-off-by: Tomer Asida <[email protected]>

use the Tri Dao based prefill convolution kernel in mamba2 mixer

18be34c

Signed-off-by: Tomer Asida <[email protected]>

add correctness unittests for the Dao-AILab/causal-conv1d based conv …

0f65780

…kernels (similar to the tests in tests/unittest/_torch/thop/test_mamba_conv1d_op.py) Signed-off-by: Tomer Asida <[email protected]>

format and style

843a6a1

Signed-off-by: Tomer Asida <[email protected]>

Add oad skipping support for ssm decode kernel

25ccdd9

Signed-off-by: Tomer Asida <[email protected]>

import adaptations to state-spaces/mamba based mamba2 SSM prefill ker…

09c701d

…nels for better numerical stability and support for initial states + varlen batching (AKA continuous batching) Signed-off-by: Tomer Asida <[email protected]>

add correctness unittests for the state-spaces/mamba based mamba2 SSM…

8e18347

… prefill and decode kernels (similar to the tests in tests/unittest/_torch/thop/test_selective_scan_op.py) Signed-off-by: Tomer Asida <[email protected]>

align ref state shape with kernel output (insteadl of the other way a…

3a1bf43

…round) Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into mamba2-kernel-updates

da93909

Signed-off-by: Tomer Asida <[email protected]>

in test, use repeat_interleave to get seq_ids instead of creating mul…

1a133e8

…tiple tensors and torch.cat Signed-off-by: Tomer Asida <[email protected]>

use TRTLLM RMSNorm instead of transformers_engine's

cbd00f1

Signed-off-by: Tomer Asida <[email protected]>

pass z to SSM decode kernel, same as it is passed to prefill kernel

e0c02ed

Signed-off-by: Tomer Asida <[email protected]>

refactor - do more operations outside of the prefill/decode if-else c…

fc0170d

…ause instead of duplicating code Signed-off-by: Tomer Asida <[email protected]>

change shape of ssm state in mamba cache instead of transposing durin…

0a0ddd2

…g forward pass Signed-off-by: Tomer Asida <[email protected]>

no need to permute(0,1) and contiguous on conv weights. It does nothi…

867b542

…ng. conv weights are already in correct shape Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into mamba2-kernel-updates

14c3afe

Signed-off-by: Tomer Asida <[email protected]>

move PLAD_SLOT_ID to __init__ in modules/mamba

4cc77a3

Signed-off-by: Tomer Asida <[email protected]>

remove the unused causal_conv1d_fwd python function that called the T…

fd5fb1e

…RTLLM in-house mamba_conv1d kernel Signed-off-by: Tomer Asida <[email protected]>

rename MambaMixer -> Mamba2Mixer

8a96ed3

Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into mamba2-kernel-updates

226d4dc

fix import

a701bd4

Signed-off-by: Tomer Asida <[email protected]>

fix - Add PAD_SLOT_ID in test_causal_conv1d_op

2e34fc4

Signed-off-by: Tomer Asida <[email protected]>

fix build - don't import ATen, torch/all.h and c10 in causalConv1d.cu…

ed0f53a

…. Use standard TRTLLM types and macros when needed Signed-off-by: Tomer Asida <[email protected]>

tomeras91 added 10 commits June 5, 2025 16:33

Merge branch 'NVIDIA:main' into fix-nemotron-h-warmup

02d13c7

Merge branch 'main' into fix-nemotron-h-warmup

61c3db8

Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'fix-nemotron-h-warmup' of github.com:tomeras91/TensorRT…

cb80c01

…-LLM into fix-nemotron-h-warmup Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into fix-nemotron-h-warmup

8d3d4a9

Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into optimize-mamba-prefill

7b082b4

Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'fix-nemotron-h-warmup' into optimize-mamba-prefill

9492116

Signed-off-by: Tomer Asida <[email protected]>

fix: use only context sequence lengths to create cu_seqlens and seq_idx

0d42a71

(they appear first in attn_metadata.seq_lens_cuda) Signed-off-by: Tomer Asida <[email protected]>

fix - remove functions get_cu_seqlens and get_seq_idx from MambaCache…

d888c39

…Manager as self.cu_seqlens and self.seq_idx don't exist anymore Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into optimize-mamba-prefill

335c209

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested review from a team as code owners June 11, 2025 12:35

tomeras91 requested review from HuiGao-NV, symphonylyh and Copilot and removed request for symphonylyh and HuiGao-NV June 11, 2025 12:35

Copilot AI reviewed Jun 11, 2025

View reviewed changes

tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py Outdated Show resolved Hide resolved

initialize seq_idx to None to clearly indicate it's not viable before…

4293f0c

… prepare() is called Co-authored-by: Copilot <[email protected]> Signed-off-by: tomeras91 <[email protected]>

tomeras91 requested review from vegaluisjose and suyoggupta June 11, 2025 12:55

Merge branch 'main' into optimize-mamba-prefill

57b6721

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 changed the title ~~[TRTLLM-4923][feat] Optimized Mamba2Mixer prefill~~ [TRTLLM-5835][feat] Optimized Mamba2Mixer prefill Jun 11, 2025

suyoggupta approved these changes Jun 11, 2025

View reviewed changes

Merge branch 'main' into optimize-mamba-prefill

1616bae

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from Naveassaf June 12, 2025 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128

[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128

Uh oh!

tomeras91 commented Jun 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tomeras91 commented Jun 12, 2025

Uh oh!

tomeras91 commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

Uh oh!

[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128

Are you sure you want to change the base?

[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128

Uh oh!

Conversation

tomeras91 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

tomeras91 commented Jun 12, 2025

Uh oh!

tomeras91 commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

Uh oh!

tomeras91 commented Jun 11, 2025 •

edited

Loading