-
-
Notifications
You must be signed in to change notification settings - Fork 7k
[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: tjtanaa <[email protected]> Signed-off-by: vllmellm <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Description
This PR integrates enables Aiter's fused Mixture-of-Experts ops, found here, to be used with v1.
Implementation
The following ops have been added/modified and registered as custom ops:
rocm_aiter_ck_moe
rocm_aiter_fmoe_fp8_blockscale_g1u1
rocm_aiter_asm_moe
rocm_aiter_topk_softmax
rocm_aiter_shuffle_weight
rocm_aiter_asm_moe_tkw1
Testing
The integration has been verified through:
Accuracy Test GSM8K
The following command has been used to run Lmeval on the following models:
Additionally we set some addiational vars/args for some models as specified below:
Llama-4-Maverick-17B-128E-Instruct:
VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
-quantization=fp8
Llama-4-Maverick-17B-128E-Instruct-FP8:
VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
DeepSeek-V3:
VLLM_USE_V1=0
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=1
Mixtral-8x7B-Instruct-v0.1:
VLLM_USE_V1=1
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
* Note
Setting
VLLM_ROCM_USE_AITER=1
andVLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=0
effectively dispatchesrocm_aiter_ck_moe
as the fused expert function.We provide the table below to show the lm_eval results :
This PR is part of a larger effort to integrate AITER kernels into vLLM for improved performance on the ROCm platform.