[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche · 2025-01-22T18:01:06Z

TL;DR: Basically another take at #7448 based on the work on the Whisper model, with sugar on top to provide a drop-in replacement module.

Addressing TODOs https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/bart.py#L352 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L750.

Current cross-attention QKV projection is sub-optimal as we're wasting cycles on bigger-than-necessary matrices, especially important in the compute-bound stage. That is because QKVParallellLinear layers are being used to only compute the q and kv projection, separately in two sequential calls.

I propose adopting the solution we make use of here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/whisper.py#L173, where q\kv are being split into a ColumnParallelLinear and QKVParallelLinear layer, respectively, instantiating and sharding only the matrices we actually make use of. Support of tensor parallelism should be unscathed.

I also provide a drop-in replacement util layer QKVCrossParallellLinear to use in substitution of QKVParallellLinear layers such that loading code remains the same, especially the usual stacked_params_mapping.

==>Let me know what you think about the util Module interface/API, otherwise I can just substitute in its optimized code inline.

Early benchmarking results (single L4 24gb, running facebook/bart-large-cnn):

PRE-PR b197a5cc

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.64      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.29      
Output token throughput (tok/s):         102.67    
Total Token throughput (tok/s):          319.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          73.88     
Median TTFT (ms):                        69.12     
P99 TTFT (ms):                           129.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.50     
Median TPOT (ms):                        9.94      
P99 TPOT (ms):                           14.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.93      
Median ITL (ms):                         8.70      
P99 ITL (ms):                            10.50     
==================================================

POST-PR

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.62      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.31      
Output token throughput (tok/s):         103.29    
Total Token throughput (tok/s):          321.46    
---------------Time to First Token----------------
Mean TTFT (ms):                          74.35     
Median TTFT (ms):                        69.84     
P99 TTFT (ms):                           129.07    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.37     
Median TPOT (ms):                        9.81      
P99 TPOT (ms):                           14.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.84      
Median ITL (ms):                         8.75      
P99 ITL (ms):                            9.85      
==================================================

TODO:

Document QKVCrossParallellLinear both in code and docs in "how to add model"
Replace in other encoder decoder models ([Model] Add T5 model (2/2) #11901?)

github-actions · 2025-01-22T18:01:17Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

This looks like a great improvement to me! Cc @afeldman-nm

DarkLight1337 · 2025-02-08T15:15:35Z

Please fix the error in distributed tests.

afeldman-nm

Thanks for the PR! I think this will be a valuable fix for encoder/decoder models. Just had one or two pieces of feedback.

A general observation - the original QKVParallelLinear has pretty limited test-coverage; the only unit-tests I could find were LoRA-oriented tests in tests/lora. So in #7448 I did not bother adding additional unit tests for QCrossKVParallelLinear, and I see that that is also the case in this PR.

That said, I observe that QKVParallelLinear has pretty complicated weight loading logic, with a lot of special cases for different weight representation formats i.e. GGUF, "bitsandbytes_4bit", etc.:

vllm/vllm/model_executor/layers/linear.py

Line 824 in 4c82229

def weight_loader(self,

In fact, there appear to be two different weight loading methods, weight_loader() and weight_loader_v2() (I actually don't know what the difference between these methods is):

vllm/vllm/model_executor/layers/linear.py

Line 798 in 4c82229

def weight_loader_v2(self,

So I am wondering, how many of these different weight-loading scenarios are supported by QKVCrossParallelLinear? All of them, or just a minimal set which covers typical cases? Personally I think it is fine to cover only the most commonly-used configurations in this PR.

vllm/model_executor/models/bart.py

vllm/model_executor/models/utils.py

NickLucche · 2025-02-19T09:50:14Z

Thanks for reviewing!

how many of these different weight-loading scenarios are supported by QKVCrossParallelLinear?

Given I am only re-using the QKVParallel and ColumnParallel pre-instantiated weight loaders, I would expect it to work with any format already covered by the two layers.
The weight_loader v1/v2 creation happens here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/linear.py#L322 conditioned on the quant_config which I forward to the two layers here https://github.com/vllm-project/vllm/pull/12325/files#diff-b0ba1095e9881e5c87e33dfd20958d1e1ceafe8a4433aa692f468e61be130b21R676.

I did not bother adding additional unit tests for QCrossKVParallelLinear, and I see that that is also the case in this PR.

I've given some thought to this but I couldn't find a way to add meaningful tests, other than writing a "test model loading" function, which is already covered by all other correctness tests. Let me know if I've overlooked something.

Signed-off-by: NickLucche <[email protected]>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025

mgoin approved these changes Feb 4, 2025

View reviewed changes

NickLucche force-pushed the encdec-separate-crossattn branch from 08f284e to 9205c72 Compare February 17, 2025 16:52

afeldman-nm suggested changes Feb 18, 2025

View reviewed changes

vllm/model_executor/models/bart.py Show resolved Hide resolved

vllm/model_executor/models/utils.py Outdated Show resolved Hide resolved

NickLucche and others added 9 commits February 19, 2025 10:04

first draft

2778a1c

Signed-off-by: NickLucche <[email protected]>

cleanup

872b7fa

Signed-off-by: NickLucche <[email protected]>

submodules in dict to avoid param registration

f57620c

Signed-off-by: NickLucche <[email protected]>

mllama test

f5e661a

Signed-off-by: NickLucche <[email protected]>

format

3f2fb99

Signed-off-by: NickLucche <[email protected]>

clean up comments

edf9f5c

Signed-off-by: NickLucche <[email protected]>

fix distributed use

75784b7

Signed-off-by: NickLucche <[email protected]>

format

8393b98

Signed-off-by: NickLucche <[email protected]>

address review

75ce6ac

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the encdec-separate-crossattn branch from 12d448a to 75ce6ac Compare February 19, 2025 10:04

NickLucche requested a review from mgoin February 19, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche commented Jan 22, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 22, 2025

mgoin left a comment

DarkLight1337 commented Feb 8, 2025

afeldman-nm left a comment

NickLucche commented Feb 19, 2025 •

edited

Loading

[Core] Optimizing cross-attention QKVParallelLinear computation #12325

Are you sure you want to change the base?

[Core] Optimizing cross-attention QKVParallelLinear computation #12325

Conversation

NickLucche commented Jan 22, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 22, 2025

mgoin left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Feb 8, 2025

afeldman-nm left a comment

Choose a reason for hiding this comment

NickLucche commented Feb 19, 2025 • edited Loading

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche commented Jan 22, 2025 •

edited by github-actions bot

Loading

NickLucche commented Feb 19, 2025 •

edited

Loading