Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

Open
karkeranikitha opened this issue May 16, 2024 · 5 comments
Labels
question Further information is requested

Comments

@karkeranikitha
Copy link

karkeranikitha commented May 16, 2024

Hi

I am trying to load my continually pretrained Llama-2-7B model on a 16GB GPU machine. Since we cannot load the model directly using AutoModelForCausalLM.from_pretrained, I am using the below approach mentioned in the repo

import torch
from transformers import AutoModel
state_dict = torch.load("output_dir/model.pth")
model = AutoModel.from_pretrained(
    "output_dir/", state_dict=state_dict
)

I am getting inadequate memory erorrs when I try to load it via GPU as well as CPU. I have applied quantization as well which should work with a 16GB machine but the process gets killed abruptly.

PFB the scripts for both

GPU

import torch
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig

state_dict = torch.load(model.pth', map_location = torch.device('cuda:0'))
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"": 0}, torch_dtype=torch.float16, state_dict=state_dict, quantization_config=quantization_config)
CPU

import torch
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig

state_dict = torch.load(model.pth')
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, state_dict=state_dict, quantization_config=quantization_config)

Is there a way to load this model in a 16GB GPU machine with 64GB RAM?
Please suggest a solution

@rasbt
Copy link
Collaborator

rasbt commented May 16, 2024

Sorry, I am not super familiar with HF and this might be more of a question for the HF forum. But in the line device_map={"": 0} for GPU, should this perhaps be device_map={"cuda": 0}?

@rasbt rasbt added the question Further information is requested label May 16, 2024
@karkeranikitha
Copy link
Author

@rasbt device_map={"cuda": 0} gives same result. I have tried that as well.

@karkeranikitha
Copy link
Author

  1. My first question is "Can I infer a pretrained llama2-7b model using a 16GB GPU machine with any possible quantization approach?"
  2. Second question: Since we cannot load model converted from litgpt to HF format directly using AutoModelForCausalLM.from_pretrained, the system memory is utilized twice. When I load state_dict, ~23GB memory is utilized & later when I load the model using AutoModelForCausalLM its taking additional memory. Is there a way to do all at once without redundancy? Ref

@rasbt
Copy link
Collaborator

rasbt commented May 16, 2024

As a workaround, would you be able to load the model on CPU using the approach above, save it via model.save_pretrained(save_directory) and then load it directly via AutoModelForCausalLM.from_pretrained in GPU memory?

@karkeranikitha
Copy link
Author

I tried this approach but the model size reduces from 26GB to 3GB and the results are not as expected. It is returning blank output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants