Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Optimizing cross-attention QKVParallelLinear computation #12325

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

NickLucche
Copy link
Contributor

@NickLucche NickLucche commented Jan 22, 2025

TL;DR: Basically another take at #7448 based on the work on the Whisper model, with sugar on top to provide a drop-in replacement module.

Addressing TODOs https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/bart.py#L352 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L750.

Current cross-attention QKV projection is sub-optimal as we're wasting cycles on bigger-than-necessary matrices, especially important in the compute-bound stage. That is because QKVParallellLinear layers are being used to only compute the q and kv projection, separately in two sequential calls.

I propose adopting the solution we make use of here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/whisper.py#L173, where q\kv are being split into a ColumnParallelLinear and QKVParallelLinear layer, respectively, instantiating and sharding only the matrices we actually make use of. Support of tensor parallelism should be unscathed.

I also provide a drop-in replacement util layer QKVCrossParallellLinear to use in substitution of QKVParallellLinear layers such that loading code remains the same, especially the usual stacked_params_mapping.

==>Let me know what you think about the util Module interface/API, otherwise I can just substitute in its optimized code inline.

Early benchmarking results (single L4 24gb, running facebook/bart-large-cnn):

PRE-PR b197a5cc

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.64      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.29      
Output token throughput (tok/s):         102.67    
Total Token throughput (tok/s):          319.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          73.88     
Median TTFT (ms):                        69.12     
P99 TTFT (ms):                           129.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.50     
Median TPOT (ms):                        9.94      
P99 TPOT (ms):                           14.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.93      
Median ITL (ms):                         8.70      
P99 ITL (ms):                            10.50     
==================================================

POST-PR

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.62      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.31      
Output token throughput (tok/s):         103.29    
Total Token throughput (tok/s):          321.46    
---------------Time to First Token----------------
Mean TTFT (ms):                          74.35     
Median TTFT (ms):                        69.84     
P99 TTFT (ms):                           129.07    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.37     
Median TPOT (ms):                        9.81      
P99 TPOT (ms):                           14.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.84      
Median ITL (ms):                         8.75      
P99 ITL (ms):                            9.85      
==================================================

TODO:

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great improvement to me! Cc @afeldman-nm

@DarkLight1337
Copy link
Member

Please fix the error in distributed tests.

@NickLucche NickLucche force-pushed the encdec-separate-crossattn branch from 08f284e to 9205c72 Compare February 17, 2025 16:52
Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think this will be a valuable fix for encoder/decoder models. Just had one or two pieces of feedback.

A general observation - the original QKVParallelLinear has pretty limited test-coverage; the only unit-tests I could find were LoRA-oriented tests in tests/lora. So in #7448 I did not bother adding additional unit tests for QCrossKVParallelLinear, and I see that that is also the case in this PR.

That said, I observe that QKVParallelLinear has pretty complicated weight loading logic, with a lot of special cases for different weight representation formats i.e. GGUF, "bitsandbytes_4bit", etc.:

def weight_loader(self,

In fact, there appear to be two different weight loading methods, weight_loader() and weight_loader_v2() (I actually don't know what the difference between these methods is):

def weight_loader_v2(self,

So I am wondering, how many of these different weight-loading scenarios are supported by QKVCrossParallelLinear? All of them, or just a minimal set which covers typical cases? Personally I think it is fine to cover only the most commonly-used configurations in this PR.

@NickLucche
Copy link
Contributor Author

NickLucche commented Feb 19, 2025

Thanks for reviewing!

how many of these different weight-loading scenarios are supported by QKVCrossParallelLinear?

Given I am only re-using the QKVParallel and ColumnParallel pre-instantiated weight loaders, I would expect it to work with any format already covered by the two layers.
The weight_loader v1/v2 creation happens here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/linear.py#L322 conditioned on the quant_config which I forward to the two layers here https://github.com/vllm-project/vllm/pull/12325/files#diff-b0ba1095e9881e5c87e33dfd20958d1e1ceafe8a4433aa692f468e61be130b21R676.

I did not bother adding additional unit tests for QCrossKVParallelLinear, and I see that that is also the case in this PR.

I've given some thought to this but I couldn't find a way to add meaningful tests, other than writing a "test model loading" function, which is already covered by all other correctness tests. Let me know if I've overlooked something.

NickLucche and others added 9 commits February 19, 2025 10:04
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@NickLucche NickLucche force-pushed the encdec-separate-crossattn branch from 12d448a to 75ce6ac Compare February 19, 2025 10:04
@NickLucche NickLucche requested a review from mgoin February 19, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants