[Attention] MLA with chunked prefill #12639

LucasWilkinson · 2025-02-01T04:43:16Z

Need to do more benchmarking to see if this makes sense to be on by default in V0, but lays the groundwork for a V1 implementation. (#13111 may help performance)

lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False --task gsm8k --num_fewshot=5 --limit 100

vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|


lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 100


vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|

Shout-out to @pathorn for assisting with hardening this PR

Future work:

Allocate the worst case result of self.kv_b_proj(kv_c_normed) in the profile run
[Attention] MLA with chunked prefill #12639 (comment)
Improved algo for allocating workspace amongst batch elements
Improve how the workspace is allocated

github-actions · 2025-02-01T04:43:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-06T05:22:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-02-07T03:56:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/engine/arg_utils.py

vllm/attention/backends/utils.py

csrc/cuda_utils.h

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson · 2025-02-15T01:15:24Z

@tlrmchlsmth

I wonder if it would be better to detect if we are in the profile run and allocate temporary tensors of size equal to the upper limit on the workspace required, instead of what we are doing now. It sounds like there might be an edge case where we run out of memory, and if so we should address before landing

this should be addressed by: 1c59597

without this commit I get:

model weights take 84.11GiB; non_torch_memory takes 5.13GiB; PyTorch activation peak memory takes 0.19GiB; the rest of the memory reserved for KV Cache is 36.41GiB.

with it I get:

model weights take 84.11GiB; non_torch_memory takes 5.13GiB; PyTorch activation peak memory takes 1.17GiB; the rest of the memory reserved for KV Cache is 35.42GiB.

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify · 2025-02-15T01:16:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson · 2025-02-15T01:44:36Z

NOTE: @pathorn found a bug when stress testing R1, will notify here when resolved

https://vllm-dev.slack.com/archives/C08AD2B5HH8/p1739521144253459?thread_ts=1739486497.566799&cid=C08AD2B5HH8

Edit: should be resolved by 920ecc6#diff-00753a3c1f378f8b8c60e9eb10b94c3cbbfcea74fca6e66712e5d4ae360f6741

tlrmchlsmth · 2025-02-15T22:49:28Z

vllm/attention/backends/mla/common.py

+        if attn_metadata.is_profile_run and \
+            attn_metadata.chunked_prefill_workspace is not None:
+            # During the profile run try to simulate to worse case output size
+            # for `self.kv_b_proj(kv_c_normed)` in `_compute_prefill_context`
+            # since this can be large
+            _ = torch.empty(
+                (attn_metadata.chunked_prefill_workspace.shape[0], 
+                 self.num_heads, self.qk_nope_head_dim + self.v_head_dim),
+                device=k_c_normed.device,
+                dtype=k_c_normed.dtype,
+            )


Looks great - definitely feel good about the profile_run now

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth

🎉

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth · 2025-02-19T20:46:17Z

vllm/engine/arg_utils.py

-            if model_config.is_multimodal_model and model_config.use_mla:
+            if model_config.is_multimodal_model or model_config.use_mla:


ok yeah that makes sense for some of the red tests

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify · 2025-02-21T01:57:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson changed the title ~~[Attention] WIP MLA with chunked prefill~~ [WIP][Attention] WIP MLA with chunked prefill Feb 1, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from f939824 to 77be9af Compare February 4, 2025 21:15

pathorn mentioned this pull request Feb 6, 2025

Implement chunked prefill for Triton MLA attention backend #12800

Open

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 77be9af to bf6a400 Compare February 6, 2025 02:27

mergify bot added the needs-rebase label Feb 6, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch 2 times, most recently from 463e453 to c542cc4 Compare February 6, 2025 05:24

mergify bot added v1 and removed needs-rebase labels Feb 6, 2025

LucasWilkinson changed the title ~~[WIP][Attention] WIP MLA with chunked prefill~~ [Attention] WIP MLA with chunked prefill Feb 6, 2025

LucasWilkinson marked this pull request as ready for review February 6, 2025 05:49

LucasWilkinson requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 6, 2025 05:49

mergify bot added the needs-rebase label Feb 7, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 727b265 to c2d5468 Compare February 7, 2025 16:44

mergify bot removed the needs-rebase label Feb 7, 2025

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/attention/backends/utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

csrc/cuda_utils.h Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 7bffc5c to de3474d Compare February 12, 2025 01:04

LucasWilkinson added 2 commits February 13, 2025 21:47

chunked mla

4267344

Signed-off-by: Lucas Wilkinson <[email protected]>

add gather cache kernel

2821aed

Signed-off-by: Lucas Wilkinson <[email protected]>

robertgshaw2-redhat force-pushed the lwilkinson/chunked-mla branch from 9c42fb0 to a79ee4c Compare February 15, 2025 00:44

extra workspace allocation during profile run

1c59597

Signed-off-by: Lucas Wilkinson <[email protected]>

robertgshaw2-redhat force-pushed the lwilkinson/chunked-mla branch from 8e7bcae to 1c59597 Compare February 15, 2025 01:13

rename

1137f76

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot added the needs-rebase label Feb 15, 2025

tlrmchlsmth reviewed Feb 15, 2025

View reviewed changes

fix illegal memory access

920ecc6

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth approved these changes Feb 18, 2025

View reviewed changes

LucasWilkinson added 2 commits February 18, 2025 02:02

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

0547a94

Signed-off-by: Lucas Wilkinson <[email protected]>

format

b665575

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot removed the needs-rebase label Feb 18, 2025

LucasWilkinson added 2 commits February 18, 2025 04:19

format

3a0ae51

Signed-off-by: Lucas Wilkinson <[email protected]>

mypy pass

28464b5

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 18, 2025

tlrmchlsmth enabled auto-merge (squash) February 18, 2025 13:47

tlrmchlsmth mentioned this pull request Feb 18, 2025

set chunked_prefill off when use mla #13374

Open

tlrmchlsmth and others added 2 commits February 19, 2025 15:55

Merge branch 'main' into lwilkinson/chunked-mla

609267b

fix basic model test

dfb3ada

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth reviewed Feb 19, 2025

View reviewed changes

LucasWilkinson and others added 3 commits February 19, 2025 21:51

attempt to fix AMD build

9ca182b

Signed-off-by: Lucas Wilkinson <[email protected]>

attempt 2 fix amd build

d325935

Signed-off-by: Lucas Wilkinson <[email protected]>

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

6394a8a

LucasWilkinson mentioned this pull request Feb 20, 2025

[WIP][Kernel] Flashinfer MLA support #13630

Draft

mergify bot added the needs-rebase label Feb 21, 2025

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

f17599e

mergify bot removed the needs-rebase label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Attention] MLA with chunked prefill #12639

[Attention] MLA with chunked prefill #12639

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

LucasWilkinson commented Feb 15, 2025

mergify bot commented Feb 15, 2025

LucasWilkinson commented Feb 15, 2025 •

edited

Loading

tlrmchlsmth Feb 15, 2025

tlrmchlsmth left a comment

tlrmchlsmth Feb 19, 2025

mergify bot commented Feb 21, 2025

		if model_config.is_multimodal_model and model_config.use_mla:
		if model_config.is_multimodal_model or model_config.use_mla:

[Attention] MLA with chunked prefill #12639

Are you sure you want to change the base?

[Attention] MLA with chunked prefill #12639

Conversation

LucasWilkinson commented Feb 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

LucasWilkinson commented Feb 15, 2025

mergify bot commented Feb 15, 2025

LucasWilkinson commented Feb 15, 2025 • edited Loading

tlrmchlsmth Feb 15, 2025

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth Feb 19, 2025

Choose a reason for hiding this comment

mergify bot commented Feb 21, 2025

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Feb 15, 2025 •

edited

Loading