[Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1

### 🚀 The feature, motivation and pitch

For low concurrency cases, the kernels running after the all reduce and before the first fused_moe kernel, take the same amount of time as the first fused_moe kernel. These kernels all run with very few CTAs in the low concurrency case utilizing only a small fraction of the GPU, and the number of kernels means we lose a lot of time to inter-kernel spacing as well.

For concurrency 4 on H200 with EP8:

<img width="1895" height="445" alt="Image" src="https://github.com/user-attachments/assets/79fc980f-9eea-4269-8d59-bfa51dd7e2ae" />

triton_red_Fused__to_copy_add_mean_mul_pow_rqsrt_0: 3.42us
nvjet_tst_64x8_64x16_4x1_v_bz_splitK_TNT: 4.64us
splitKReduce_kernel: 1.66us
per_token_group_quant_8bit_kernel: 2.18us
sm90_fp8_gemm_1d2d_impl: 8.77us
elementwise_kernel: 2.18us
elementwise_kernel: 2.02us
per_token_group_quant_8bit_kernel: 1.73us
sm90_fp8_gemm_1d2d_impl: 2.78us
triton_poi_fused__to_copy_add_sigmoid_0: 1.31us
topk_with_k2_kernel: 1.34us
group_idx_and_topk_idx_kernel: 6.50us
triton_poi_fused__to_copy_0: 0.93us
per_token_group_quant_8bit_kernel: 2.08us
moe_align_block_size_kernel: 3.87us
count_and_sort_expert_tokens_kernel: 1.25us
unrolled_elementwise_kernel: 1.60us
index_elementwise_kernel: 2.21us
Total with inter-kernel time: 52.1us.

fused_moe: 51.1us

Where possible for the low-concurrency case, we should fuse these kernels to bring the time down.

### Alternatives

TensorRT-LLM launches the equivalent kernels on two separate streams to try and better utilize the GPU. It also has some of these fused. As a result it runs in ~25us.

<img width="1898" height="439" alt="Image" src="https://github.com/user-attachments/assets/0b9c7706-9fcb-4312-899d-d27aecdb8f15" />

### Additional context

This is mentioned in #24629 as potential fusion 8.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1 #28084

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1 #28084

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions