Static KV cache status: How to use it? Does it work for all models?

I see that there are many PRs about [StaticCache](https://github.com/huggingface/transformers/pulls?q=is%3Apr+StaticCache), but I couldn't find a clear documentation on how to use it.

#### What I want

* To not have Transformers allocate memory dynamically for the KV cache when using `model.generate()`, as that leads to increased memory usage (due to garbage collection not happening fast/often enough) and worse performance.

* To use that by default always, for every model, for every supported quantization backend (AutoAWQ, AutoGPTQ, AQLM, bitsandbytes, etc).

#### Who can help?

Maybe @gante 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static KV cache status: How to use it? Does it work for all models? #33270

What I want

Who can help?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Static KV cache status: How to use it? Does it work for all models? #33270

Description

What I want

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions