KV cache in VLLM serve #17054

farouk09 · 2025-04-23T12:54:37Z

farouk09
Apr 23, 2025

Hi everyone! 👋

I'm trying to run the model Qwen/Qwen2.5-14B-Instruct-1M on an NVIDIA RTX A6000 (49.1 GB VRAM) using vllm serve with the --dtype auto option. However, I'm getting the following error:

ValueError: The model's max seq len (1010000) is larger than the maximum number of tokens that can be stored in KV cache (73792). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

From my understanding:

The model has ~14B parameters, so in FP16 it should use ~28GB of VRAM (please correct me if that’s inaccurate).
Running in auto mode (which defaults to FP16), the weights should take ~14GB.
That leaves ~36GB, which I thought would be enough for the KV cache for a 1M context length. But apparently not?

So I have two questions:

Is my assumption about the memory breakdown (weights vs KV cache) correct?
How can we estimate the required VRAM for a given number of context tokens? Is there a formula or rule of thumb to calculate the KV cache size needed based on context length and model size?

More broadly, I'd really appreciate if someone could explain how to estimate the total VRAM usage in vllm serve, including weights, KV cache, context window, etc.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache in VLLM serve #17054

{{title}}

Replies: 0 comments

Select a reply

KV cache in VLLM serve #17054

farouk09 Apr 23, 2025

Replies: 0 comments

farouk09
Apr 23, 2025