Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

karkeranikitha · 2024-05-16T06:32:50Z

Hi

I am trying to load my continually pretrained Llama-2-7B model on a 16GB GPU machine. Since we cannot load the model directly using AutoModelForCausalLM.from_pretrained, I am using the below approach mentioned in the repo

import torch
from transformers import AutoModel
state_dict = torch.load("output_dir/model.pth")
model = AutoModel.from_pretrained(
    "output_dir/", state_dict=state_dict
)

I am getting inadequate memory erorrs when I try to load it via GPU as well as CPU. I have applied quantization as well which should work with a 16GB machine but the process gets killed abruptly.

PFB the scripts for both

GPU

import torch
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig

state_dict = torch.load(model.pth', map_location = torch.device('cuda:0'))
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"": 0}, torch_dtype=torch.float16, state_dict=state_dict, quantization_config=quantization_config)

CPU

import torch
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig

state_dict = torch.load(model.pth')
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, state_dict=state_dict, quantization_config=quantization_config)

Is there a way to load this model in a 16GB GPU machine with 64GB RAM?
Please suggest a solution

The text was updated successfully, but these errors were encountered:

rasbt · 2024-05-16T12:05:24Z

Sorry, I am not super familiar with HF and this might be more of a question for the HF forum. But in the line device_map={"": 0} for GPU, should this perhaps be device_map={"cuda": 0}?

karkeranikitha · 2024-05-16T12:29:11Z

@rasbt device_map={"cuda": 0} gives same result. I have tried that as well.

karkeranikitha · 2024-05-16T12:36:33Z

My first question is "Can I infer a pretrained llama2-7b model using a 16GB GPU machine with any possible quantization approach?"
Second question: Since we cannot load model converted from litgpt to HF format directly using AutoModelForCausalLM.from_pretrained, the system memory is utilized twice. When I load state_dict, ~23GB memory is utilized & later when I load the model using AutoModelForCausalLM its taking additional memory. Is there a way to do all at once without redundancy? Ref

rasbt · 2024-05-16T14:13:46Z

As a workaround, would you be able to load the model on CPU using the approach above, save it via model.save_pretrained(save_directory) and then load it directly via AutoModelForCausalLM.from_pretrained in GPU memory?

karkeranikitha · 2024-05-21T14:21:07Z

I tried this approach but the model size reduces from 26GB to 3GB and the results are not as expected. It is returning blank output

rasbt added the question Further information is requested label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

karkeranikitha commented May 16, 2024 •

edited

rasbt commented May 16, 2024

karkeranikitha commented May 16, 2024

karkeranikitha commented May 16, 2024

rasbt commented May 16, 2024

karkeranikitha commented May 21, 2024

Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine #1423

Comments

karkeranikitha commented May 16, 2024 • edited

rasbt commented May 16, 2024

karkeranikitha commented May 16, 2024

karkeranikitha commented May 16, 2024

rasbt commented May 16, 2024

karkeranikitha commented May 21, 2024

karkeranikitha commented May 16, 2024 •

edited