-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[Feature]Add async tensor parallelism using compilation pass #17882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: cascade812 <[email protected]>
This is what I'm seeing on a 4xH200 system: vLLM main:
This PR:
|
It also shows ~10% latency reduce on 4 X A100 (40GB) for 8B LLM.
With async tp enabled
|
Signed-off-by: cascade812 <[email protected]>
Signed-off-by: cascade812 <[email protected]>
Hi! Is it necessary to always set sequence_parallelism to true? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Concise and effective. I only had a few cleanup comments.
No need. sequence parallelism is enabled by default if |
Signed-off-by: cascade812 <[email protected]>
Signed-off-by: cascade812 <[email protected]>
Signed-off-by: cascade812 <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: cascade812 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work
Signed-off-by: cascade812 <[email protected]>
This PR adds torch async tp using compilation pass.
It requires below config to run
If use
vllm serve
, add-O '{"level":3, "compile_sizes": [4, 8, 16], "pass_config": {"enable_async_tp": true}}'
Some benchmark results on 2 GPUs of A100.
Latency is slightly higher when enable async tp with input len is 2048.
Latency is almost the same when enable async tp with input len is 8192.
I think we can test this feature on a more demanding workload, like a 70B model across 4 GPUs.