Skip to content

Static KV cache status: How to use it? Does it work for all models? #33270

@oobabooga

Description

@oobabooga

I see that there are many PRs about StaticCache, but I couldn't find a clear documentation on how to use it.

What I want

  • To not have Transformers allocate memory dynamically for the KV cache when using model.generate(), as that leads to increased memory usage (due to garbage collection not happening fast/often enough) and worse performance.

  • To use that by default always, for every model, for every supported quantization backend (AutoAWQ, AutoGPTQ, AQLM, bitsandbytes, etc).

Who can help?

Maybe @gante

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions