I see that there are many PRs about StaticCache, but I couldn't find a clear documentation on how to use it.
What I want
-
To not have Transformers allocate memory dynamically for the KV cache when using model.generate(), as that leads to increased memory usage (due to garbage collection not happening fast/often enough) and worse performance.
-
To use that by default always, for every model, for every supported quantization backend (AutoAWQ, AutoGPTQ, AQLM, bitsandbytes, etc).
Who can help?
Maybe @gante
I see that there are many PRs about StaticCache, but I couldn't find a clear documentation on how to use it.
What I want
To not have Transformers allocate memory dynamically for the KV cache when using
model.generate(), as that leads to increased memory usage (due to garbage collection not happening fast/often enough) and worse performance.To use that by default always, for every model, for every supported quantization backend (AutoAWQ, AutoGPTQ, AQLM, bitsandbytes, etc).
Who can help?
Maybe @gante