Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running DeepSeek-VL2 with multiple cards #8

Open
robinren03 opened this issue Dec 16, 2024 · 5 comments
Open

Running DeepSeek-VL2 with multiple cards #8

robinren03 opened this issue Dec 16, 2024 · 5 comments

Comments

@robinren03
Copy link

robinren03 commented Dec 16, 2024

I have 3 A6000 GPUs with 48GB memory each, and I need to use TP to load DeepSeek-VL2 model into the GPUs (not the tiny / small ones).

Here is my code.

from accelerate import infer_auto_device_map, dispatch_model
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
device_map = infer_auto_device_map(vl_gpt, max_memory={0: "45GiB", 1: "45GiB", 2:"45GiB"}, no_split_module_classes=["DeepseekV2DecoderLayer"])
vl_gpt = vl_gpt.to(torch.bfloat16)
vl_gpt = dispatch_model(vl_gpt, device_map=device_map).eval()

And it comes into the following problem:

Traceback (most recent call last):
File "/data3/ryanyu/llm-img/baseline/deepseek.py", line 148, in
query_one_question(os.path.join(data_dir, filename), img_ext, use_image=False)
File "/data3/ryanyu/llm-img/baseline/deepseek.py", line 54, in query_one_question
outputs = vl_gpt.language.generate(
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
result = self._sample(
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/transformers/generation/utils.py", line 3303, in _sample
next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Is there any official guidelines about using TP (i.e. MP) or PP to run the larger models? Or could you please point out my mistake. Thanks a lot!

@robinren03 robinren03 changed the title Running DeepSeek-VL2 with TP Running DeepSeek-VL2 with multiple cards Dec 17, 2024
@robinren03
Copy link
Author

Well, the code works if I have the following change:

 outputs = vl_gpt.language.generate(
        input_ids = prepare_inputs["input_ids"].to(vl_gpt.device),
        inputs_embeds=inputs_embeds,
        attention_mask=prepare_inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        do_sample=False,
        use_cache=True
    )

But it just prints some meaning less words, such as:

:ighb���������������� alter alter alter� alter iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod督办 iPod iPod کننده督办ospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialogtospatialogtospatialospatialospatialogtogtogtospatialogtospatialospatialospatial.mainospatial.mainospatial Gest Gest Gest Gest Gest Gest Gest Gest Gest Gest Gest'ag Gest Gest Gest Gest Gest Gest Gest'ag Gest'ag'ag'ag'ag'ag'ag909909909 Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity815815815815815815最关键815最关键815最关键 flavor flavor flavor最关键 flavor最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键 Majesty Majesty alter Majesty Majesty Majesty alter Majesty Majesty Majesty Majesty Majesty Majesty Majesty Dinner Dinner Majesty Majesty smallest smallest Majesty Majesty Osborne Osborne Osborne Osborne OsborneORES Majesty Majesty alter Majesty Majesty Majesty Majesty Majesty讷 Majesty讷� Majesty Majesty Majestyakang Majesty Majesty Majestyすること Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty została Majesty Majesty­Født alter纪念馆Født纪念馆 مطال مطال مطال Majesty Majesty替换替换 Majesty Majestytox Majesty Majesty Majesty Majesty Majesty Majesty выс Majesty­­ Majesty召开了 alter alter Majesty Majesty望着望着望着望着­望着望着望着JB Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Baseline Majesty Majesty Majesty Majesty Majesty MajestyJB/openährungährungährung NNW­­scaler­­­scaler芊­芊 Majesty â Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty对我们对我们­­­ washes Majesty進一步 Majesty粳 Majestyurope Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty pledged pledged Majesty Majesty Majesty Majesty Majesty Majestyalter którego Majesty Majesty Majesty création Majesty源性源性三部 Majesty Majesty Majestyraga Majestyraga Majesty Majesty inoculation Majestyотоотоото Majesty św آسیب Majesty آسیب Majesty Majesty Majesty Majesty Majestyews Majestyнтинтинти Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty withd Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty المرض duidelijk duidelijk duidelijk Majesty Majesty duidelijk Majesty Majesty elif Majesty elif elif elif elif elif elif elif elif

robinren03 added a commit to robinren03/DeepSeek-VL2 that referenced this issue Dec 17, 2024
@HubHop
Copy link
Collaborator

HubHop commented Dec 23, 2024

Hi @robinren03, please consider this minimal code for model sharding. Based on my preliminary tests, it has successfully run our 16B and 27B models. However, I have not extensively tested which sharding strategy performs the best. You may want to experiment with tuning these parameters later based on your specific requirements.

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

def split_model(model_name):
    device_map = {}
    model_splits = {
        'deepseek-ai/deepseek-vl2-small': [13, 14], # 2 GPU for 16b
        'deepseek-ai/deepseek-vl2': [10, 10, 10], # 3 GPU for 27b
    }
    num_layers_per_gpu = model_splits[model_name]
    num_layers =  sum(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision'] = 0
    device_map['projector'] = 0
    device_map['image_newline'] = 0
    device_map['view_seperator'] = 0
    device_map['language.model.embed_tokens'] = 0
    device_map['language.model.norm'] = 0
    device_map['language.lm_head'] = 0
    device_map[f'language.model.layers.{num_layers - 1}'] = 0
    return device_map


# specify the path to the model
model_path = 'deepseek-ai/deepseek-vl2'
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

device_map = split_model(model_path)
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,torch_dtype=torch.bfloat16, device_map=device_map).eval()

## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)

@DeadLining
Copy link

After the update, deepseek-vl2 appears to have an issue with KV Cache
File "/home/getui/kongsz/gt-bi-lab-mllm-research/DeepSeek-VL2-main/deepseek_vl2/models/modeling_deepseek.py", line 1292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/home/getui/kongsz/gt-bi-lab-mllm-research/DeepSeek-VL2-main/deepseek_vl2/models/modeling_deepseek.py", line 885, in forward k_pe, compressed_kv = past_key_value.update(k_pe, compressed_kv, self.layer_idx, cache_kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/transformers/cache_utils.py", line 449, in update self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

@dprokhorov17
Copy link

Doesn't work for me too. I am on two H100 cards. I followed #8 (comment) but I do get garbage output. Here is my code:

(I downloaded the model beforehand to local-dir)

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

def split_model():
    device_map = {}
    num_layers_per_gpu = [15, 15]
    num_layers =  sum(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision'] = 0
    device_map['projector'] = 0
    device_map['image_newline'] = 0
    device_map['view_seperator'] = 0
    device_map['language.model.embed_tokens'] = 0
    device_map['language.model.norm'] = 0
    device_map['language.lm_head'] = 0
    device_map[f'language.model.layers.{num_layers - 1}'] = 0
    return device_map


# specify the path to the model
model_path = 'deepseek'
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained("deepseek-ai/deepseek-vl2")
tokenizer = vl_chat_processor.tokenizer

device_map = split_model()
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,torch_dtype=torch.bfloat16, device_map=device_map).eval()


## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>All Apps Button<|/ref|>.",
        "images": ["android_home_screen.png"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    input_ids = prepare_inputs["input_ids"],
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)
image image

@geekchen007
Copy link

geekchen007 commented Jan 21, 2025

I used three 16G NVIDIA V100 to load the DeepSeek-VL2-small model, and failed.

# auto加载
# vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, 
#                                                                        torch_dtype=torch.float16, 
#                                                                        device_map='auto',
#                                                                        trust_remote_code=True).eval()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants