-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Optimizing cross-attention QKVParallelLinear
computation
#12325
base: main
Are you sure you want to change the base?
[Core] Optimizing cross-attention QKVParallelLinear
computation
#12325
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great improvement to me! Cc @afeldman-nm
Please fix the error in distributed tests. |
08f284e
to
9205c72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I think this will be a valuable fix for encoder/decoder models. Just had one or two pieces of feedback.
A general observation - the original QKVParallelLinear
has pretty limited test-coverage; the only unit-tests I could find were LoRA-oriented tests in tests/lora
. So in #7448 I did not bother adding additional unit tests for QCrossKVParallelLinear
, and I see that that is also the case in this PR.
That said, I observe that QKVParallelLinear
has pretty complicated weight loading logic, with a lot of special cases for different weight representation formats i.e. GGUF, "bitsandbytes_4bit", etc.:
vllm/vllm/model_executor/layers/linear.py
Line 824 in 4c82229
def weight_loader(self, |
In fact, there appear to be two different weight loading methods, weight_loader()
and weight_loader_v2()
(I actually don't know what the difference between these methods is):
vllm/vllm/model_executor/layers/linear.py
Line 798 in 4c82229
def weight_loader_v2(self, |
So I am wondering, how many of these different weight-loading scenarios are supported by QKVCrossParallelLinear
? All of them, or just a minimal set which covers typical cases? Personally I think it is fine to cover only the most commonly-used configurations in this PR.
Thanks for reviewing!
Given I am only re-using the
I've given some thought to this but I couldn't find a way to add meaningful tests, other than writing a "test model loading" function, which is already covered by all other correctness tests. Let me know if I've overlooked something. |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
12d448a
to
75ce6ac
Compare
TL;DR: Basically another take at #7448 based on the work on the Whisper model, with sugar on top to provide a drop-in replacement module.
Addressing TODOs https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/bart.py#L352 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L750.
Current cross-attention QKV projection is sub-optimal as we're wasting cycles on bigger-than-necessary matrices, especially important in the compute-bound stage. That is because
QKVParallellLinear
layers are being used to only compute theq
andkv
projection, separately in two sequential calls.I propose adopting the solution we make use of here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/whisper.py#L173, where q\kv are being split into a
ColumnParallelLinear
andQKVParallelLinear
layer, respectively, instantiating and sharding only the matrices we actually make use of. Support of tensor parallelism should be unscathed.I also provide a drop-in replacement util layer
QKVCrossParallellLinear
to use in substitution ofQKVParallellLinear
layers such that loading code remains the same, especially the usualstacked_params_mapping
.==>Let me know what you think about the util Module interface/API, otherwise I can just substitute in its optimized code inline.
Early benchmarking results (single L4 24gb, running
facebook/bart-large-cnn
):PRE-PR
b197a5cc
POST-PR
TODO:
QKVCrossParallellLinear
both in code and docs in "how to add model"