Skip to content

Loading DeepSeek R1 model took extremely long time #37160

Closed
@Neo9061

Description

@Neo9061

System Info

Following the recent merged PR and release note, I try to load the DeepSeek R1 model using code snippet below on single P5EN (8 H200) GPUs.

  1. The first issue I have is that it took very long time to load. Estimation is ~10 hours.
  2. Then I tried to modify the config.json and model.safetensors.index.json to try to load first 10 layers including embed_token and lm.head modules. However, it hit following error. The issue is gone if I used DeepSeek conversion script to convert the checkpoint from FP8 to BF16.
Some parameters are on the meta device because they were offloaded to the cpu.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Traceback (most recent call last):
  File "/iofsx/sds3/models/DeepSeekV3/test.py", line 18, in <module>
    outputs = model.generate(inputs, max_new_tokens=50)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/generation/utils.py", line 2370, in generate
    result = self._sample(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/generation/utils.py", line 3331, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 1025, in forward
    outputs = self.model(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 773, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 513, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 423, in forward
    q_states = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(query_shape).transpose(1, 2)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Float8_e4m3fn

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code snippet to load and generate text example.

# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)
model_path = "MYMODEL_PATH"
tokenizer = AutoTokenizer.from_pretrained(model_path)

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))

Expected behavior

NA

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions