Error when using trainer with default data parallelism enabled: RuntimeError: chunk expects at least a 1-dimensional tensor

### System Info

`transformers-cli env` output:
- `transformers` version: 4.50.3
- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.29.3
- Safetensors version: 0.5.3
- Accelerate version: 1.5.2
- Accelerate config:    not found
- DeepSpeed version: 0.16.4
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes, implicitly
- Using GPU in script?: yes
- GPU type: NVIDIA GeForce RTX 4090

Other pertinent versions:
```
(venv) user@vm:~/dev/projects/project$ python3 -c 'import torch; print(torch.version.cuda)'
12.4
(venv) user@vm:~/dev/projects/project$ nvidia-smi
...
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
...
```

And it doesn't show except in the extended nvidia-smi output, but there are two 4090s on this device and I can tell that the trainer is using them both by default because of a spike in memory usage and GPU% when the script starts.

### Who can help?

@zach-huggingface @ArthurZucker

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

First off, the docs bot is not working as of the creation of this issue, so I did not have that as a resource.

Full traceback:
```
Traceback (most recent call last):
  File "/home/user/dev/projects/project/min_repro.py", line 56, in <module>
    trainer.train()
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3783, in compute_loss
    outputs = model(**inputs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
    inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 207, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 89, in scatter_kwargs
    scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 75, in scatter
    res = scatter_map(inputs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 66, in scatter_map
    return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 58, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 103, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 205, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: chunk expects at least a 1-dimensional tensor
```

Here is the minimal reproduction code that I used to produce this output:
```python
from datasets import load_dataset
from transformers import AutoConfig, GPT2LMHeadModel, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=512,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

training_args = TrainingArguments(
    output_dir="./logs",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
)
model = GPT2LMHeadModel(config)

dataset = load_dataset("wikitext", "wikitext-2-v1", streaming=False)
def tokenize_mapping(examples):
    outputs = tokenizer(examples["text"], truncation=True, max_length=512, return_overflowing_tokens=True, return_length=True)
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == 512:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}
dataset = dataset.map(tokenize_mapping, batched=True, remove_columns=dataset["train"].column_names)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    processing_class=tokenizer,
)

print(f"Starting training with {sum(p.numel() for p in model.parameters()):,} parameters")
print(f"Model: {model}")
print(f"Tokenizer: {tokenizer}")

trainer.train()
```

Using [this](https://huggingface.co/learn/nlp-course/en/chapter7/6?fw=pt#training-a-causal-language-model-from-scratch) official doc as a guide.

### Expected behavior

I would expect the very slightly modified training script to create and train a gpt2 model on the wikitext-2-v1 dataset. Instead, an exception is raised indicating some issue with the data parallelization integration. Also potentially of note is that I tried loading the pretrained gpt2 model and finetuning it with a modified version of this script (both with and without `device_map='balanced'` in the call to `from_pretrained`) and it still produces this chunking error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using trainer with default data parallelism enabled: RuntimeError: chunk expects at least a 1-dimensional tensor #37151

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error when using trainer with default data parallelism enabled: RuntimeError: chunk expects at least a 1-dimensional tensor #37151

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions