tensor parallel MOE implementation #2293

scv119 · 2023-12-28T02:47:31Z

This PR implements tensor parallel MOE by sharding each expert across all ranks.

concretely, it does following:

column parallel each expert's w1 and w3 weights
row parallel each expert's w2 weight
for each batch, it groups the per request's hidden state according to routing decision.
apply all experts mlp using grouped gemm
apply routing weights
all reduce to collect the result across TP ranks
merge the per request result across different experts.

benchmark result:
A100 80G * 8, input_len=32, output_len=128

baseline:
batch_size 1: 2.1385579633448892 seconds
batch_size 8: 2.428515106982862 seconds
batch_size 32: 2.9776507209753618 seconds
batch_size 64: 3.7744668100300864 seconds

this PR
batch_size 1: 1.6442222506545174 seconds (77%)
batch_size 8: 2.3404843776564426 seconds (96%)
batch_size 32: 3.0149446266586892 seconds (101%)
batch_size 64: 3.878694705994955 seconds (103%)

A100 80G * 4, input_len=32, output_len=128

baseline:
batch_size 1: 2.9904473346929685 seconds
batch_size 8: 3.2857296433260976 seconds
batch_size 32: 3.917926660312029 seconds
batch_size 64: 4.401127053638144 seconds

this PR
batch_size 1: 1.6416094843492222 seconds (55%)
batch_size 8: 2.9794040496732728 seconds (91%)
batch_size 32: 3.631852053649103 seconds (93%)
batch_size 64: 4.388253151012274 seconds (100%)

csrc/bincount.cu

scv119 · 2023-12-29T16:30:51Z

running into some weird torch.sort issues during cuda-graph capture...

WoosukKwon · 2024-01-04T07:12:48Z

Hi @scv119, thanks for addressing my comments! I haven't actually completed the review yet. Will add more tonight or tmr morning.

scv119 · 2024-01-04T08:05:46Z

@WoosukKwon just let you know the triton grouped matmul returns different result from torch reference implementation for large matrix multiplication, which is likely caused by triton-lang/triton#1190 (comment) but that's purely my speculation.

we might need to use https://github.com/imoneoi/cutlass_grouped_gemm if it matters.

scv119 · 2024-01-17T02:43:37Z

1089dd8
we delayed the allreduce after the weights are merged by indices, which reduces the communications by half.

WoosukKwon · 2024-01-17T17:38:30Z

vllm/model_executor/layers/moe.py

+ grouped_w1_out = grouped_matmul(expanded_hidden_states,
+ cum_experts_range, w1s, "silu")
+ grouped_w3_out = grouped_matmul(expanded_hidden_states,
+ cum_experts_range, w3s)


Can we merge w1s and w3 just like what we do for LlamaMLP? Merging the two weights will be highly efficient given the cost of grouped GEMM.

WoosukKwon · 2024-01-17T17:41:16Z

vllm/model_executor/layers/moe.py

+ self,
+ expanded_hidden_states: torch.
+ Tensor, # [batch_size * top_k_experts, hidden_size]
+ reverse_indices, # [batch_size * top_k_experts]


Suggested change

reverse_indices, # [batch_size * top_k_experts]

reverse_indices: torch.Tensor, # [batch_size * top_k_experts]

WoosukKwon · 2024-01-17T19:35:55Z

vllm/model_executor/layers/moe.py

+ set_weight_attrs(self.w1s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "column"
+ })
+ set_weight_attrs(self.w2s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "row"
+ })
+ set_weight_attrs(self.w3s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "column"
+ })


nit: Can we make this compatible with other parallel linear layers by tagging input_dim and output_dim instead of tp_type?

WoosukKwon · 2024-01-17T19:36:24Z

vllm/model_executor/layers/moe.py

+ set_weight_attrs(self.w1s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "column"
+ })
+ set_weight_attrs(self.w2s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "row"
+ })
+ set_weight_attrs(self.w3s, {
+ "weight_loader": self.weight_loader,
+ "tp_type": "column"
+ })


nit: Can we make this more similar to other parallel linear layers by tagging input_dim and output_dim instead of tp_type?

WoosukKwon · 2024-01-17T19:37:33Z

vllm/model_executor/models/mixtral.py

+ expert_params_mapping = [
+ # (param_name, weight_name, expert_id)
+ (f"{weight_name}s", f"experts.{expert_id}.{weight_name}.weight",
+ expert_id) for expert_id in range(self.config.num_local_experts)
+ for weight_name in ["w1", "w2", "w3"]
+ ]


Here, do we assume that the expert linear layers don't have bias terms?

WoosukKwon

Thanks @scv119 for the updates! The PR looks good to me overall. For the grouped GEMM, stuff I think we can investigate the Cutlass implementation later. I actually spent some time understanding it last weekend, but found it a bit difficult to understand. For now, I think the Triton kernel is acceptable, and it is actually needed for AMD GPUs anyway.

chu-tianxiang · 2024-01-18T04:37:35Z

Any insights into how the quantized model will be managed please? There's a challenge regarding the weights: it may not be possible to concatenate them due to differences in experts. For instance, GPTQ might employ distinct activation order and AWQ might use varying scales. Thank you.

esmeetu · 2024-01-18T10:23:04Z

vllm/model_executor/layers/moe.py

+ linear_method=None)
+
+ self.w1s = nn.Parameter(
+ torch.empty(self.num_total_experts,


If there are many experts like deepseekmoe, it is easy to oom in this function. Any ideas to improve memory utilization?

scv119 · 2024-01-18T18:38:41Z

thanks @WoosukKwon. will do another pass; also we noticed some poor performance on h100, probably need tune the kernel parameters a bit.

pcmoritz · 2024-01-21T00:56:19Z

On H100s, changing the number of SMs to 256 brought the best improvement in terms of throughtput for me (but still not quite matching current master). It was measured with python benchmarks/benchmark_throughput.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 8 --num-prompts 1000:

Current master: 28600 tok/s
Current PR: 24900 tok/s
New hyperparameters below: 27600 tok/s

All numbers have an error of about +/- 200 tok/s.

It is quite possible that by tuning more / autotuning we can get even better results here -- I'd love to learn about it if anybody has better parameters :)

diff --git a/vllm/model_executor/layers/moe.py b/vllm/model_executor/layers/moe.py
index 6d37884302..94bb9f0858 100644
--- a/vllm/model_executor/layers/moe.py
+++ b/vllm/model_executor/layers/moe.py
@@ -335,15 +335,13 @@ def grouped_matmul(input: torch.Tensor,
     BLOCK_SIZE_M = 16
     BLOCK_SIZE_N = 64
     BLOCK_SIZE_K = 32
-    num_warps = 2
-    NUM_SM = 128
+    num_warps = 4
+    NUM_SM = 256
     num_stages = 5
     # hand tune the block size for different problem sizes.
     if input.shape[0] >= 8:
-        num_warps = 4
         BLOCK_SIZE_N = 128
     if input.shape[0] >= 32:
-        num_warps = 4
         BLOCK_SIZE_M = 32
         BLOCK_SIZE_N = 128
     # we use a fixed number of CTA, and it's auto-tunable

scv119 · 2024-01-30T02:21:06Z

i think one overhead of this PR is too many small elementwise operations that are not fused according to my profile.
#2453 should be a better version of this one, thus closing here

scv119 added 17 commits December 27, 2023 15:07

expert parallel moe

92c3a3c

update

d24d9dd

update

408ed6d

update

ec913db

update

56f3220

update

6067ec6

update

869e0c5

update

ea44a0f

update

90e223a

update

69b5a55

update

357b046

update

86f7e1e

update

40f0f59

update

baa90d2

update

4367d6a

update

b4df657

update

1ac4890

scv119 commented Dec 29, 2023

View reviewed changes

csrc/bincount.cu Outdated Show resolved Hide resolved

scv119 added 11 commits December 29, 2023 22:13

update

14f29b3

update

b3a1b77

update

8045832

update

5eb304a

update

a172a7c

update

a56b2df

update

ca7110e

update

82de999

update

20fcbc0

update

92709c1

update

22daa9b

update

fdd5b77

scv119 added 5 commits January 3, 2024 23:15

update

17d17f1

update

6c60c3b

update

9749e64

update

4bee472

update

0fe75f3

scv119 added 5 commits January 4, 2024 00:14

update

cce13fb

update

f0f1d5e

update

42b3cc3

update

0a8069b

update

f955162

WoosukKwon self-requested a review January 16, 2024 21:47

scv119 added 2 commits January 16, 2024 18:39

reorder operations

1089dd8

Merge remote-tracking branch 'origin/main' into moe

43ec685

WoosukKwon reviewed Jan 17, 2024

View reviewed changes

cadedaniel mentioned this pull request Jan 17, 2024

Potential speedup for Mixtral to avoid forward pass of MOE layer if expert is not selected #2459

Closed

esmeetu mentioned this pull request Jan 18, 2024

Deepseek moe #2467

Closed

4 tasks

esmeetu reviewed Jan 18, 2024

View reviewed changes

zwd003 mentioned this pull request Jan 18, 2024

DeepseekMoE support with Fused MoE kernel #2453

Merged

pcmoritz mentioned this pull request Jan 22, 2024

Fused MOE for Mixtral #2542

Merged

scv119 closed this Jan 30, 2024

youkaichao mentioned this pull request Mar 14, 2024

[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor parallel MOE implementation #2293

tensor parallel MOE implementation #2293

scv119 commented Dec 28, 2023 •

edited

Loading

scv119 commented Dec 29, 2023

WoosukKwon commented Jan 4, 2024

scv119 commented Jan 4, 2024 •

edited

Loading

scv119 commented Jan 17, 2024

WoosukKwon Jan 17, 2024

WoosukKwon Jan 17, 2024

WoosukKwon Jan 17, 2024

WoosukKwon Jan 17, 2024

WoosukKwon Jan 17, 2024

WoosukKwon left a comment

chu-tianxiang commented Jan 18, 2024

esmeetu Jan 18, 2024

scv119 commented Jan 18, 2024

pcmoritz commented Jan 21, 2024

scv119 commented Jan 30, 2024

	reverse_indices, # [batch_size * top_k_experts]
	reverse_indices: torch.Tensor, # [batch_size * top_k_experts]

tensor parallel MOE implementation #2293

tensor parallel MOE implementation #2293

Conversation

scv119 commented Dec 28, 2023 • edited Loading

scv119 commented Dec 29, 2023

WoosukKwon commented Jan 4, 2024

scv119 commented Jan 4, 2024 • edited Loading

scv119 commented Jan 17, 2024

WoosukKwon Jan 17, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 17, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 17, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 17, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 17, 2024

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

chu-tianxiang commented Jan 18, 2024

esmeetu Jan 18, 2024

Choose a reason for hiding this comment

scv119 commented Jan 18, 2024

pcmoritz commented Jan 21, 2024

scv119 commented Jan 30, 2024

scv119 commented Dec 28, 2023 •

edited

Loading

scv119 commented Jan 4, 2024 •

edited

Loading