You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run the model Qwen/Qwen2.5-14B-Instruct-1M on an NVIDIA RTX A6000 (49.1 GB VRAM) using vllm serve with the --dtype auto option. However, I'm getting the following error:
ValueError: The model's max seq len (1010000) is larger than the maximum number of tokens that can be stored in KV cache (73792). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.
From my understanding:
The model has ~14B parameters, so in FP16 it should use ~28GB of VRAM (please correct me if that’s inaccurate).
Running in auto mode (which defaults to FP16), the weights should take ~14GB.
That leaves ~36GB, which I thought would be enough for the KV cache for a 1M context length. But apparently not?
So I have two questions:
Is my assumption about the memory breakdown (weights vs KV cache) correct?
How can we estimate the required VRAM for a given number of context tokens? Is there a formula or rule of thumb to calculate the KV cache size needed based on context length and model size?
More broadly, I'd really appreciate if someone could explain how to estimate the total VRAM usage in vllm serve, including weights, KV cache, context window, etc.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone! 👋
I'm trying to run the model Qwen/Qwen2.5-14B-Instruct-1M on an NVIDIA RTX A6000 (49.1 GB VRAM) using vllm serve with the --dtype auto option. However, I'm getting the following error:
ValueError: The model's max seq len (1010000) is larger than the maximum number of tokens that can be stored in KV cache (73792). Try increasing
gpu_memory_utilizationor decreasing
max_model_lenwhen initializing the engine.
From my understanding:
So I have two questions:
More broadly, I'd really appreciate if someone could explain how to estimate the total VRAM usage in
vllm serve
, including weights, KV cache, context window, etc.Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions