-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU inference on RTX 3060 fails with RuntimeError: RuntimeError: CUDA error: device-side assert triggered (Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.) #28284
Comments
Hi @levidehaan, thanks for raising this issue! Could you provide a checkpoint for a public model which replicates this issue? Does this happen for any llama model? import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import logging
import os
print(torch.version.cuda)
modellocation = "/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0"
tokenizer = AutoTokenizer.from_pretrained("/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0")
model = AutoModelForCausalLM.from_pretrained(
modellocation,
device_map="auto"
)
start_time = time.time()
data = # fill in data here
conversation = data['messages']
prompt = tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, use_cache=True, max_length=4096)
outputs = model.generate(**inputs, use_cache=True, max_length=4096)
output_text = tokenizer.decode(outputs[0])
json_output = jsonify(
{'choices': [{'message': {'role': 'assistant', 'content': output_text}}]}
)
request_duration = time.time() - request.start_time
print(f"Request took {request_duration} seconds")
print(json_output) Based on the error message, I'd be willing to bet it's an indexing issue with cc @gante as this covers rope, cuda and generate :) |
@levidehaan Is there any follow-up on this issue? I'm encountering exactly the same issue happened at the same line of the LLaMA-2 inferrence. |
Hi @BiEchi 👋 we can look further into the issue as soon as we have a short stand-alone reproducible script. Otherwise, it will next to impossible to reproduce the bug and, consequently, find out what's wrong :) |
Hi @gante, my code is
and the error is
|
|
@muellerzr this seems accelerate-related (multi-device) -- do you know what might be wrong? |
Gentle ping @muellerzr |
IIRC @gante was working on this, did you find the fix? |
@muellerzr I have not -- I can't even reproduce the issue, as I don't have a multi-gpu setting 😅 Would you be able to take a look? Reproducer shared above (that works fine on a single GPU): from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
print(torch.version.cuda)
llama_path = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(llama_path)
model = AutoModelForCausalLM.from_pretrained(llama_path, device_map='auto')
prompt = "Hello World!"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print(inputs)
with torch.inference_mode():
logits = model(**inputs).logits[0]
print(logits) |
Gentle ping @muellerzr |
Will give this a look today |
I think this issue might be too machine-specific, so the same code could also work on your end. I think it would be safe to close this issue. |
Sadly would have to agree as I can't reproduce it! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi! I also meet the same problem sadly:(, the code can work on single GPU, but CUDA error: device-side assert triggered when i use it on 2GPUS, have u fix this problem yet? |
System Info
Version: transformers-4.36.2
I have 2 RTX 3060's and i am able to run LLM's on One GPU but it wont work when i try to run them on 2 GPU's with the error:
Running Arch
It runs fine on single GPU, but when i try to run on multiple it does not like it. I have tried with oobabooga and vllm and none of them make a difference, it always fails. I have tried with many models of many sizes and types 7b 8x7b 13b 33b 34b, awq doesnt work, gptq doesnt work either.
my motherboard is a MSi MEG x570 with a AMD APU in it, and it has no option to disable ACS in the bios
GPU comms test came back good (i think ):
Let me know what else you need me to do/run/compile or download to help get this fixed.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
my code:
Run it
python project.py
Expected behavior
it should run on 2 GPU's
The text was updated successfully, but these errors were encountered: