[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

youkaichao · 2024-03-13T22:14:36Z

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json

benchmarks/kernels/benchmark_mixtral_moe.py

pcmoritz · 2024-03-14T02:42:03Z

Thanks for the refactorings! While you are touching this code, one thing that would be wonderful to do is to keep track of the timings for the best configuration for each batch size. This could e.g. be done by writing them to a separate file. This would allow you to decide if a new configuration is better than the old one.

Also note that running the script as-is will likely not produce optimal results in some settings, since there is a bunch of parameter pruning going on at the moment (e.g. for the batch size). Sometimes it is important to look at the values found and then expand the search space if it runs into the boundaries :)

youkaichao · 2024-03-14T02:47:14Z

@pcmoritz This manual kernel tuning is kind of temporary. Going forward, we plan to use triton.autotune to automatically tune these configs. So we don't need to invest too much time here.

WoosukKwon · 2024-03-14T02:59:10Z

vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json

- "2048": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 4, "num_stages": 4},
- "3072": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 1, "num_warps": 8, "num_stages": 4},
- "4096": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 8, "num_stages": 4}
+ "1": {


The configuration here is quite different from the one we have right now. Could you compare the old and new ones by benchmarking the end-to-end performance (e.g., using benchmark_throughput.py on Mixtral)?

Also, it'd be nice if you can benchmark other configs as well, if not all.

iirc, LLaMA models don't use MoE. Do you mean mixtral models?

Yep I mean mixtral, not llama

do you have some common setting for me to run the throughput test? Otherwise I'm blindly running python benchmarks/benchmark_throughput.py --input-len 100 --output-len 100. Not sure if input 100 tokens and output 100 tokens are the cases people care the most.

I will use python benchmarks/benchmark_throughput.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 from #2293 (comment) .

pcmoritz · 2024-03-14T03:08:07Z

@pcmoritz This manual kernel tuning is kind of temporary. Going forward, we plan to use triton.autotune to automatically tune these configs. So we don't need to invest too much time here.

In my experience triton.autotune is far too slow to be useful (unless the configs have already been run / are cached) :)

youkaichao · 2024-03-14T03:52:42Z

command: python benchmarks/benchmark_throughput.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 2

H100 GPU:

benchmark	no config	w/o this PR	w/ PR
tp=2	11.73 requests/s, 12313.00 tokens/s	14.11 requests/s, 14816.20 tokens/s	15.69 requests/s, 16477.71 tokens/s
tp=4	18.07 requests/s, 18974.30 tokens/s	same as no config	22.06 requests/s, 23166.51 tokens/s
tp=8	24.80 requests/s, 26039.89 tokens/s	same as no config	28.17 requests/s, 29577.75 tokens/s

A100 GPU: TODO (don't have 8*A100 GPU at hand now)

@WoosukKwon benchmarking results are quite promising!

youkaichao · 2024-03-14T03:53:34Z

In my experience triton.autotune is far too slow to be useful (unless the configs have already been run / are cached) :)

Will definitely try to cache tuned configs!

WoosukKwon · 2024-03-14T06:48:28Z

@youkaichao Awesome! Could you 1) update the PR with the current main and 2) fix the lint error by running ./format.sh? You will have to run pip install -r requirements-dev.txt before it.

youkaichao · 2024-03-14T07:04:18Z

@WoosukKwon lint is good now 👌

WoosukKwon

LGTM! Thanks for the PR! Excited about the performance improvement!

… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)

change benchmark script so that result can be directly used

2f356ed

youkaichao changed the title ~~[Kernel] change benchmark script so that result can be directly used~~ [Kernel][WIP] change benchmark script so that result can be directly used Mar 13, 2024

WoosukKwon reviewed Mar 13, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

youkaichao added 5 commits March 13, 2024 15:31

update function name

005eaf2

update moe kernel tuning config

02ba322

add more moe kernel tuning config

8f98e08

add more moe kernel tuning config

b936d3f

add more moe kernel tuning config

e82d641

youkaichao changed the title ~~[Kernel][WIP] change benchmark script so that result can be directly used~~ [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 Mar 14, 2024

youkaichao marked this pull request as ready for review March 14, 2024 00:38

WoosukKwon reviewed Mar 14, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json Show resolved Hide resolved

WoosukKwon reviewed Mar 14, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json Outdated Show resolved Hide resolved

add newline in the end of file

be38ca7

pcmoritz reviewed Mar 14, 2024

View reviewed changes

benchmarks/kernels/benchmark_mixtral_moe.py Outdated Show resolved Hide resolved

expose vllm.model_executor.layers.fused_moe.get_config_file_name

b4a60f5

WoosukKwon reviewed Mar 14, 2024

View reviewed changes

youkaichao added 3 commits March 13, 2024 23:51

Merge remote-tracking branch 'vllm/main' into moe-kernel-tuning

db79917

fix yapf

8c44d59

fix ruff

9b8f7c5

WoosukKwon enabled auto-merge (squash) March 14, 2024 07:51

WoosukKwon disabled auto-merge March 14, 2024 07:51

WoosukKwon approved these changes Mar 14, 2024

View reviewed changes

WoosukKwon enabled auto-merge (squash) March 14, 2024 07:52

WoosukKwon merged commit 8fe8386 into vllm-project:main Mar 14, 2024
24 checks passed

starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024

[Kernel] change benchmark script so that result can be directly used;…

febe969

… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)

youkaichao deleted the moe-kernel-tuning branch March 14, 2024 16:19

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Kernel] change benchmark script so that result can be directly used;…

eac0abf

… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

youkaichao commented Mar 13, 2024

pcmoritz commented Mar 14, 2024 •

edited

Loading

youkaichao commented Mar 14, 2024

WoosukKwon Mar 14, 2024 •

edited

Loading

WoosukKwon Mar 14, 2024

youkaichao Mar 14, 2024

WoosukKwon Mar 14, 2024

youkaichao Mar 14, 2024

youkaichao Mar 14, 2024

pcmoritz commented Mar 14, 2024

youkaichao commented Mar 14, 2024

youkaichao commented Mar 14, 2024

WoosukKwon commented Mar 14, 2024

youkaichao commented Mar 14, 2024

WoosukKwon left a comment •

edited

Loading

[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

Conversation

youkaichao commented Mar 13, 2024

pcmoritz commented Mar 14, 2024 • edited Loading

youkaichao commented Mar 14, 2024

WoosukKwon Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Mar 14, 2024

Choose a reason for hiding this comment

youkaichao Mar 14, 2024

Choose a reason for hiding this comment

WoosukKwon Mar 14, 2024

Choose a reason for hiding this comment

youkaichao Mar 14, 2024

Choose a reason for hiding this comment

youkaichao Mar 14, 2024

Choose a reason for hiding this comment

pcmoritz commented Mar 14, 2024

youkaichao commented Mar 14, 2024

youkaichao commented Mar 14, 2024

WoosukKwon commented Mar 14, 2024

youkaichao commented Mar 14, 2024

WoosukKwon left a comment • edited Loading

Choose a reason for hiding this comment

pcmoritz commented Mar 14, 2024 •

edited

Loading

WoosukKwon Mar 14, 2024 •

edited

Loading

WoosukKwon left a comment •

edited

Loading