[feat] ability to set max_num_seqs #87

kalocide · 2024-07-30T06:56:59Z

The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.

I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.

kalocide · 2024-07-30T06:59:49Z

For context, when running an 11B model on an L40S, the throughput is okay, but the GPU barely gets used (1-2%) because the CPU-side PyTorch is the bottleneck. The throughput would be significantly faster if this value were set at a lower value such that the KV cache can fit in VRAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] ability to set max_num_seqs #87

[feat] ability to set max_num_seqs #87

kalocide commented Jul 30, 2024

kalocide commented Jul 30, 2024

[feat] ability to set max_num_seqs #87

[feat] ability to set max_num_seqs #87

Comments

kalocide commented Jul 30, 2024

kalocide commented Jul 30, 2024