generated from runpod-workers/worker-template
-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] ability to set max_num_seqs #87
Comments
For context, when running an 11B model on an L40S, the throughput is okay, but the GPU barely gets used (1-2%) because the CPU-side PyTorch is the bottleneck. The throughput would be significantly faster if this value were set at a lower value such that the KV cache can fit in VRAM. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.
I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.
The text was updated successfully, but these errors were encountered: