Running DeepSeek-VL2 with multiple cards #8

robinren03 · 2024-12-16T16:38:36Z

I have 3 A6000 GPUs with 48GB memory each, and I need to use TP to load DeepSeek-VL2 model into the GPUs (not the tiny / small ones).

Here is my code.

from accelerate import infer_auto_device_map, dispatch_model
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
device_map = infer_auto_device_map(vl_gpt, max_memory={0: "45GiB", 1: "45GiB", 2:"45GiB"}, no_split_module_classes=["DeepseekV2DecoderLayer"])
vl_gpt = vl_gpt.to(torch.bfloat16)
vl_gpt = dispatch_model(vl_gpt, device_map=device_map).eval()

And it comes into the following problem:

Traceback (most recent call last):
File "/data3/ryanyu/llm-img/baseline/deepseek.py", line 148, in
query_one_question(os.path.join(data_dir, filename), img_ext, use_image=False)
File "/data3/ryanyu/llm-img/baseline/deepseek.py", line 54, in query_one_question
outputs = vl_gpt.language.generate(
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
result = self._sample(
File "/home/ryanyu/anaconda3/envs/deepseek/lib/python3.10/site-packages/transformers/generation/utils.py", line 3303, in _sample
next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Is there any official guidelines about using TP (i.e. MP) or PP to run the larger models? Or could you please point out my mistake. Thanks a lot!

robinren03 · 2024-12-17T07:38:07Z

Well, the code works if I have the following change:

 outputs = vl_gpt.language.generate(
        input_ids = prepare_inputs["input_ids"].to(vl_gpt.device),
        inputs_embeds=inputs_embeds,
        attention_mask=prepare_inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        do_sample=False,
        use_cache=True
    )

But it just prints some meaning less words, such as:

:ighb�� alter alter alter� alter iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod iPod督办 iPod iPod کننده督办ospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialospatialogtospatialogtospatialospatialospatialogtogtogtospatialogtospatialospatialospatial.mainospatial.mainospatial Gest Gest Gest Gest Gest Gest Gest Gest Gest Gest Gest'ag Gest Gest Gest Gest Gest Gest Gest'ag Gest'ag'ag'ag'ag'ag'ag909909909 Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity815815815815815815最关键815最关键815最关键 flavor flavor flavor最关键 flavor最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键最关键 Majesty Majesty alter Majesty Majesty Majesty alter Majesty Majesty Majesty Majesty Majesty Majesty Majesty Dinner Dinner Majesty Majesty smallest smallest Majesty Majesty Osborne Osborne Osborne Osborne OsborneORES Majesty Majesty alter Majesty Majesty Majesty Majesty Majesty讷 Majesty讷� Majesty Majesty Majestyakang Majesty Majesty Majestyすること Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty została Majesty MajestyFødt alter纪念馆Født纪念馆 مطال مطال مطال Majesty Majesty替换替换 Majesty Majestytox Majesty Majesty Majesty Majesty Majesty Majesty выс Majesty Majesty召开了 alter alter Majesty Majesty望着望着望着望着望着望着望着JB Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Baseline Majesty Majesty Majesty Majesty Majesty MajestyJB/openährungährungährung NNWscalerscaler芊芊 Majesty â Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty对我们对我们 washes Majesty進一步 Majesty粳 Majestyurope Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty pledged pledged Majesty Majesty Majesty Majesty Majesty Majestyalter którego Majesty Majesty Majesty création Majesty源性源性三部 Majesty Majesty Majestyraga Majestyraga Majesty Majesty inoculation Majestyотоотоото Majesty św آسیب Majesty آسیب Majesty Majesty Majesty Majesty Majestyews Majestyнтинтинти Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty withd Majesty Majesty Majesty Majesty Majesty Majesty Majesty Majesty المرض duidelijk duidelijk duidelijk Majesty Majesty duidelijk Majesty Majesty elif Majesty elif elif elif elif elif elif elif elif

HubHop · 2024-12-23T05:02:24Z

Hi @robinren03, please consider this minimal code for model sharding. Based on my preliminary tests, it has successfully run our 16B and 27B models. However, I have not extensively tested which sharding strategy performs the best. You may want to experiment with tuning these parameters later based on your specific requirements.

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

def split_model(model_name):
    device_map = {}
    model_splits = {
        'deepseek-ai/deepseek-vl2-small': [13, 14], # 2 GPU for 16b
        'deepseek-ai/deepseek-vl2': [10, 10, 10], # 3 GPU for 27b
    }
    num_layers_per_gpu = model_splits[model_name]
    num_layers =  sum(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision'] = 0
    device_map['projector'] = 0
    device_map['image_newline'] = 0
    device_map['view_seperator'] = 0
    device_map['language.model.embed_tokens'] = 0
    device_map['language.model.norm'] = 0
    device_map['language.lm_head'] = 0
    device_map[f'language.model.layers.{num_layers - 1}'] = 0
    return device_map


# specify the path to the model
model_path = 'deepseek-ai/deepseek-vl2'
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

device_map = split_model(model_path)
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,torch_dtype=torch.bfloat16, device_map=device_map).eval()

## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)

DeadLining · 2024-12-27T03:21:47Z

After the update, deepseek-vl2 appears to have an issue with KV Cache
File "/home/getui/kongsz/gt-bi-lab-mllm-research/DeepSeek-VL2-main/deepseek_vl2/models/modeling_deepseek.py", line 1292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/home/getui/kongsz/gt-bi-lab-mllm-research/DeepSeek-VL2-main/deepseek_vl2/models/modeling_deepseek.py", line 885, in forward k_pe, compressed_kv = past_key_value.update(k_pe, compressed_kv, self.layer_idx, cache_kwargs) File "/opt/anaconda3/envs/deepseek-vl/lib/python3.10/site-packages/transformers/cache_utils.py", line 449, in update self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

dprokhorov17 · 2025-01-07T09:55:52Z

Doesn't work for me too. I am on two H100 cards. I followed #8 (comment) but I do get garbage output. Here is my code:

(I downloaded the model beforehand to local-dir)

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

def split_model():
    device_map = {}
    num_layers_per_gpu = [15, 15]
    num_layers =  sum(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision'] = 0
    device_map['projector'] = 0
    device_map['image_newline'] = 0
    device_map['view_seperator'] = 0
    device_map['language.model.embed_tokens'] = 0
    device_map['language.model.norm'] = 0
    device_map['language.lm_head'] = 0
    device_map[f'language.model.layers.{num_layers - 1}'] = 0
    return device_map


# specify the path to the model
model_path = 'deepseek'
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained("deepseek-ai/deepseek-vl2")
tokenizer = vl_chat_processor.tokenizer

device_map = split_model()
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,torch_dtype=torch.bfloat16, device_map=device_map).eval()


## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>All Apps Button<|/ref|>.",
        "images": ["android_home_screen.png"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    input_ids = prepare_inputs["input_ids"],
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)

geekchen007 · 2025-01-21T01:54:13Z

I used three 16G NVIDIA V100 to load the DeepSeek-VL2-small model, and failed.

# auto加载
# vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, 
#                                                                        torch_dtype=torch.float16, 
#                                                                        device_map='auto',
#                                                                        trust_remote_code=True).eval()

robinren03 changed the title ~~Running DeepSeek-VL2 with TP~~ Running DeepSeek-VL2 with multiple cards Dec 17, 2024

robinren03 added a commit to robinren03/DeepSeek-VL2 that referenced this issue Dec 17, 2024

Fix issue deepseek-ai#8 and deepseek-ai#4

2ebcda0

This was referenced Dec 23, 2024

how to run DeepSeek-VL2 with multiple cards #14

Open

Cannot inference model #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running DeepSeek-VL2 with multiple cards #8

Running DeepSeek-VL2 with multiple cards #8

robinren03 commented Dec 16, 2024 •

edited

Loading

robinren03 commented Dec 17, 2024

HubHop commented Dec 23, 2024

DeadLining commented Dec 27, 2024

dprokhorov17 commented Jan 7, 2025

geekchen007 commented Jan 21, 2025 •

edited

Loading

Running DeepSeek-VL2 with multiple cards #8

Running DeepSeek-VL2 with multiple cards #8

Comments

robinren03 commented Dec 16, 2024 • edited Loading

robinren03 commented Dec 17, 2024

HubHop commented Dec 23, 2024

DeadLining commented Dec 27, 2024

dprokhorov17 commented Jan 7, 2025

geekchen007 commented Jan 21, 2025 • edited Loading

robinren03 commented Dec 16, 2024 •

edited

Loading

geekchen007 commented Jan 21, 2025 •

edited

Loading