Loading DeepSeek R1 model took extremely long time

### System Info

Following the recent merged PR and [release note](https://github.com/huggingface/transformers/releases/tag/v4.50.3-DeepSeek-3), I try to load the DeepSeek R1 model using code snippet below on single P5EN (8 H200) GPUs.



1. The first issue I have is that it took very long time to load. Estimation is ~10 hours.
2. Then I tried to modify the `config.json` and `model.safetensors.index.json` to try to load first 10 layers including embed_token and lm.head modules. However, it hit following error. The issue is gone if I used DeepSeek [conversion script](https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main/inference) to convert the checkpoint from FP8 to BF16.
```
Some parameters are on the meta device because they were offloaded to the cpu.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Traceback (most recent call last):
  File "/iofsx/sds3/models/DeepSeekV3/test.py", line 18, in <module>
    outputs = model.generate(inputs, max_new_tokens=50)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/generation/utils.py", line 2370, in generate
    result = self._sample(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/generation/utils.py", line 3331, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 1025, in forward
    outputs = self.model(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 773, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 513, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/transformers/models/deepseek_v3/modeling_deepseek_v3.py", line 423, in forward
    q_states = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(query_shape).transpose(1, 2)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/fix/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Float8_e4m3fn

```

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Code snippet to load and generate text example.

```
# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)
model_path = "MYMODEL_PATH"
tokenizer = AutoTokenizer.from_pretrained(model_path)

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))
```




### Expected behavior

NA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading DeepSeek R1 model took extremely long time #37160

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loading DeepSeek R1 model took extremely long time #37160

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions