Skip to content

Enable Quantize KV Cache for Mistral Model #35041

@Bojun-Feng

Description

@Bojun-Feng

Feature request

Enable Quantize KV Cache for Mistral Model, as described in #30483.

Motivation

KV cache quantization has emerged as a crucial optimization, particularly in high-throughput, multi-user scenarios, where efficiency is paramount.

Hugging Face currently leads in supporting quantization across a wide range of models (thanks to the effort of @zucchini-nlp), the widely used Mistral model remains unsupported. This gap presents an opportunity to extend quantization support to Mistral, addressing a significant need in the community.

In addition to this change, I am also interested in enabling kv cache quantization for more models, such as Qwen and Phi. Understanding the detailed requirements and best practices for these types of contributions would be incredibly helpful.

Your contribution

I am interested in submitting a pull request but am uncertain about the best way to verify that the behavior aligns with expectations.

In short, I tried to make a modification by naively adding the _supports_quantized=True flag to the MistralPreTrainedModel class around here:

When testing this change informally, it seemed to work as intended, producing slightly different (but still coherent) outputs compared to full precision at zero temperature, which I believe is expected.

I want to ensure that any modifications I make meet the expected standards and don't inadvertently cause issues. If my informal testing is sufficient for this case, I am happy to proceed with submitting the PR. If more rigorous validation is needed, any guidance or resources would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions