-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] MLA with chunked prefill #12639
base: main
Are you sure you want to change the base?
[Attention] MLA with chunked prefill #12639
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
f939824
to
77be9af
Compare
77be9af
to
bf6a400
Compare
This pull request has merge conflicts that must be resolved before it can be |
463e453
to
c542cc4
Compare
This pull request has merge conflicts that must be resolved before it can be |
727b265
to
c2d5468
Compare
7bffc5c
to
de3474d
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
9c42fb0
to
a79ee4c
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
8e7bcae
to
1c59597
Compare
this should be addressed by: 1c59597 without this commit I get:
with it I get:
|
Signed-off-by: Lucas Wilkinson <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
NOTE: @pathorn found a bug when stress testing R1, will notify here when resolved Edit: should be resolved by 920ecc6#diff-00753a3c1f378f8b8c60e9eb10b94c3cbbfcea74fca6e66712e5d4ae360f6741 |
if attn_metadata.is_profile_run and \ | ||
attn_metadata.chunked_prefill_workspace is not None: | ||
# During the profile run try to simulate to worse case output size | ||
# for `self.kv_b_proj(kv_c_normed)` in `_compute_prefill_context` | ||
# since this can be large | ||
_ = torch.empty( | ||
(attn_metadata.chunked_prefill_workspace.shape[0], | ||
self.num_heads, self.qk_nope_head_dim + self.v_head_dim), | ||
device=k_c_normed.device, | ||
dtype=k_c_normed.dtype, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - definitely feel good about the profile_run now
Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
vllm/engine/arg_utils.py
Outdated
if model_config.is_multimodal_model and model_config.use_mla: | ||
if model_config.is_multimodal_model or model_config.use_mla: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah that makes sense for some of the red tests
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Need to do more benchmarking to see if this makes sense to be on by default in V0, but lays the groundwork for a V1 implementation. (#13111 may help performance)
Shout-out to @pathorn for assisting with hardening this PR
Future work:
self.kv_b_proj(kv_c_normed)
in the profile run