Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] ability to set max_num_seqs #87

Open
kalocide opened this issue Jul 30, 2024 · 1 comment
Open

[feat] ability to set max_num_seqs #87

kalocide opened this issue Jul 30, 2024 · 1 comment

Comments

@kalocide
Copy link

The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.

I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.

@kalocide
Copy link
Author

For context, when running an 11B model on an L40S, the throughput is okay, but the GPU barely gets used (1-2%) because the CPU-side PyTorch is the bottleneck. The throughput would be significantly faster if this value were set at a lower value such that the KV cache can fit in VRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant