Skip to content

[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

vllmellm
Copy link
Contributor

@vllmellm vllmellm commented Apr 17, 2025

Description

This PR integrates enables Aiter's fused Mixture-of-Experts ops, found here, to be used with v1.

Implementation

The following ops have been added/modified and registered as custom ops:

  1. rocm_aiter_ck_moe
  2. rocm_aiter_fmoe_fp8_blockscale_g1u1
  3. rocm_aiter_asm_moe
  4. rocm_aiter_topk_softmax
  5. rocm_aiter_shuffle_weight
  6. rocm_aiter_asm_moe_tkw1

Testing

The integration has been verified through:

  1. High-level integration tests with various models.
  2. Accuracy Test using Lmeval.

Accuracy Test GSM8K

The following command has been used to run Lmeval on the following models:

  • Llama-4-Maverick-17B-128E-Instruct
  • Llama-4-Maverick-17B-128E-Instruct-FP8
  • DeepSeek-V3
  • Mixtral-8x7B-Instruct-v0.1
VLLM_USE_TRITON_FLASH_ATTN=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_ROCM_FP8_PADDING=1 \
VLLM_ROCM_MOE_PADDING=0 \
VLLM_ROCM_USE_AITER=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
SAFETENSORS_FAST_GPU=1 \
lm_eval \
--model vllm \
--model_args pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=8,enforce_eager=False,max_model_len=4096 \
--trust_remote_code \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto 

Additionally we set some addiational vars/args for some models as specified below:

Llama-4-Maverick-17B-128E-Instruct:

  • VLLM_USE_V1=1
  • VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
  • -quantization=fp8

Llama-4-Maverick-17B-128E-Instruct-FP8:

  • VLLM_USE_V1=1
  • VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0

DeepSeek-V3:

  • VLLM_USE_V1=0
  • VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=1

Mixtral-8x7B-Instruct-v0.1:

  • VLLM_USE_V1=1
  • VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0

* Note
Setting VLLM_ROCM_USE_AITER=1 and VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0 effectively dispatches rocm_aiter_ck_moe as the fused expert function.

We provide the table below to show the lm_eval results :

Model vLLM version Tasks Version Filter n-shot Metric   Value   Stderr
Llama-4-Maverick-17B-128E-Instruct-BF16 V1 gsm8k 3 flexible-extract 5 exact_match 0.9272 ± 0.0072
    strict-match 5 exact_match 0.9272 ± 0.0072
Llama-4-Maverick-17B-128E-Instruct-FP8 V1 gsm8k 3 flexible-extract 5 exact_match 0.9234 ± 0.0073
    strict-match 5 exact_match 0.9272 ± 0.0072
DeepSeek-V3 V0 gsm8k 3 flexible-extract 5 exact_match 0.9454 ± 0.063
    strict-match 5 exact_match 0.9454 ± 0.063
Mixtral-8x7B-Instruct-v0.1 V1 gsm8k 3 flexible-extract 5 exact_match 0.5413 ± 0.0137
    strict-match 5 exact_match 0.5398 ± 0.0137

This PR is part of a larger effort to integrate AITER kernels into vLLM for improved performance on the ROCm platform.

Co-authored-by: tjtanaa <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
@vllmellm vllmellm marked this pull request as ready for review April 23, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant