Enable Quantize KV Cache for Mistral Model

### Feature request

Enable Quantize KV Cache for Mistral Model, as described in #30483.

### Motivation

KV cache quantization has emerged as a crucial optimization, particularly in high-throughput, multi-user scenarios, where efficiency is paramount.

Hugging Face currently leads in supporting quantization across a wide range of models (thanks to the effort of @zucchini-nlp), the widely used Mistral model remains unsupported. This gap presents an opportunity to extend quantization support to Mistral, addressing a significant need in the community.

In addition to this change, I am also interested in enabling kv cache quantization for more models, such as Qwen and Phi. Understanding the detailed requirements and best practices for these types of contributions would be incredibly helpful.

### Your contribution

I am interested in submitting a pull request but am uncertain about the best way to verify that the behavior aligns with expectations.

In short, I tried to make a modification by naively adding the `_supports_quantized=True` flag to the `MistralPreTrainedModel` class [around here](https://github.com/huggingface/transformers/blob/c24c79ebf91f6f04faf287997848ed6e64d78899/src/transformers/models/mistral/modeling_mistral.py#L587):

https://github.com/huggingface/transformers/blob/c24c79ebf91f6f04faf287997848ed6e64d78899/src/transformers/models/mistral/modeling_mistral.py#L587

When testing this change informally, it seemed to work as intended, producing slightly different (but still coherent) outputs compared to full precision at zero temperature, which I believe is expected.

I want to ensure that any modifications I make meet the expected standards and don't inadvertently cause issues. If my informal testing is sufficient for this case, I am happy to proceed with submitting the PR. If more rigorous validation is needed, any guidance or resources would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Quantize KV Cache for Mistral Model #35041

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable Quantize KV Cache for Mistral Model #35041

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions