Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is Llava quantized ? #621

Open
Abhranta opened this issue Sep 22, 2024 · 3 comments
Open

How is Llava quantized ? #621

Abhranta opened this issue Sep 22, 2024 · 3 comments

Comments

@Abhranta
Copy link

In autoawq, do we only quantize the LLM part of Llava or do we also quantize the ViT ? Can we add support for quantizing the vision models like ViT or SIGLIP?

@sailfish009
Copy link

@Abhranta Hi, there is AutoGPTQ :

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

"""
Download https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview to local
Make following edits to the config.json
LlavaLlamaForCausalLM -> LlamaForCausalLM
"model_type": "llava" -> "llama"
"""
pretrained_model_dir = "./checkpoints/llava-llama-2-13b-chat-lightning-preview"

quantized_model_dir = "llava-llama-2-13b-chat-lightning-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad 
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

@Abhranta
Copy link
Author

Does this quantize only the LLM or the ViT too ?

@pratyush0599
Copy link

Hi @sailfish009, is there no native support for LLava based models. The solution you suggested seems very hacky:( I was also wondering if the quantization happens to the vision encoder too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants