[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

vllmellm · 2025-04-17T03:26:34Z

Description

This PR integrates enables Aiter's fused Mixture-of-Experts ops, found here, to be used with v1.

Implementation

The following ops have been added/modified and registered as custom ops:

rocm_aiter_ck_moe
rocm_aiter_fmoe_fp8_blockscale_g1u1
rocm_aiter_asm_moe
rocm_aiter_topk_softmax
rocm_aiter_shuffle_weight
rocm_aiter_asm_moe_tkw1

Testing

The integration has been verified through:

High-level integration tests with various models.
Accuracy Test using Lmeval.

Accuracy Test GSM8K

The following command has been used to run Lmeval on the following models:

Llama-4-Maverick-17B-128E-Instruct
Llama-4-Maverick-17B-128E-Instruct-FP8
DeepSeek-V3
Mixtral-8x7B-Instruct-v0.1

VLLM_USE_TRITON_FLASH_ATTN=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_ROCM_FP8_PADDING=1 \
VLLM_ROCM_MOE_PADDING=0 \
VLLM_ROCM_USE_AITER=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
SAFETENSORS_FAST_GPU=1 \
lm_eval \
--model vllm \
--model_args pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=8,enforce_eager=False,max_model_len=4096 \
--trust_remote_code \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto

Additionally we set some addiational vars/args for some models as specified below:

Llama-4-Maverick-17B-128E-Instruct:

VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
-quantization=fp8

Llama-4-Maverick-17B-128E-Instruct-FP8:

VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0

DeepSeek-V3:

VLLM_USE_V1=0
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=1

Mixtral-8x7B-Instruct-v0.1:

VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0

* Note
Setting VLLM_ROCM_USE_AITER=1 and VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0 effectively dispatches rocm_aiter_ck_moe as the fused expert function.

We provide the table below to show the lm_eval results :

Model	vLLM version	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
Llama-4-Maverick-17B-128E-Instruct-BF16	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.9272	±	0.0072
				strict-match	5	exact_match	↑	0.9272	±	0.0072
Llama-4-Maverick-17B-128E-Instruct-FP8	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.9234	±	0.0073
				strict-match	5	exact_match	↑	0.9272	±	0.0072
DeepSeek-V3	V0	gsm8k	3	flexible-extract	5	exact_match	↑	0.9454	±	0.063
				strict-match	5	exact_match	↑	0.9454	±	0.063
Mixtral-8x7B-Instruct-v0.1	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.5413	±	0.0137
				strict-match	5	exact_match	↑	0.5398	±	0.0137

This PR is part of a larger effort to integrate AITER kernels into vLLM for improved performance on the ROCm platform.

Co-authored-by: tjtanaa <[email protected]> Signed-off-by: vllmellm <[email protected]>

github-actions · 2025-04-17T03:26:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: vllmellm <[email protected]>

register aiter fmoe as custom ops

dbe7c99

Co-authored-by: tjtanaa <[email protected]> Signed-off-by: vllmellm <[email protected]>

vllmellm mentioned this pull request Apr 17, 2025

AITER Tkw1() for V1 Engine EmbeddedLLM/vllm#37

Open

This was referenced Apr 17, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

[ROCm] (Deprecated) Enable AITER Tkw1 kernel #16418

Draft

Merge remote-tracking branch 'origin' into aiter-fmoe-v1-support

4854c7a

Signed-off-by: vllmellm <[email protected]>

jamestwhedbee mentioned this pull request Apr 22, 2025

[Bug]: Error when running Llama-4-Maverick-17B-128E-Instruct-FP8 on mi300x #16474

Closed

1 task

vllmellm added 7 commits April 23, 2025 04:20

revert aiter moe en check func

d0ac8e6

Signed-off-by: vllmellm <[email protected]>

register moe tkw1 kernel

fe1711b

Signed-off-by: vllmellm <[email protected]>

make mypy happy

7423790

Signed-off-by: vllmellm <[email protected]>

remove comment

7c15ebb

Signed-off-by: vllmellm <[email protected]>

fix unit tests

beaff1a

Signed-off-by: vllmellm <[email protected]>

make mypy happy

091128d

Signed-off-by: vllmellm <[email protected]>

revert rocm_aiter_fused_experts args

bb19420

Signed-off-by: vllmellm <[email protected]>

vllmellm marked this pull request as ready for review April 23, 2025 11:07

vllmellm requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners April 23, 2025 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

vllmellm commented Apr 17, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 17, 2025

[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

Are you sure you want to change the base?

[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

Conversation

vllmellm commented Apr 17, 2025 • edited by github-actions bot Loading

Description

Implementation

Testing

Accuracy Test GSM8K

github-actions bot commented Apr 17, 2025

vllmellm commented Apr 17, 2025 •

edited by github-actions bot

Loading