Skip to content

Error when using trainer with default data parallelism enabled: RuntimeError: chunk expects at least a 1-dimensional tensor #37151

@Mekadrom

Description

@Mekadrom

System Info

transformers-cli env output:

  • transformers version: 4.50.3
  • Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.29.3
  • Safetensors version: 0.5.3
  • Accelerate version: 1.5.2
  • Accelerate config: not found
  • DeepSpeed version: 0.16.4
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: yes, implicitly
  • Using GPU in script?: yes
  • GPU type: NVIDIA GeForce RTX 4090

Other pertinent versions:

(venv) user@vm:~/dev/projects/project$ python3 -c 'import torch; print(torch.version.cuda)'
12.4
(venv) user@vm:~/dev/projects/project$ nvidia-smi
...
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
...

And it doesn't show except in the extended nvidia-smi output, but there are two 4090s on this device and I can tell that the trainer is using them both by default because of a spike in memory usage and GPU% when the script starts.

Who can help?

@zach-huggingface @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

First off, the docs bot is not working as of the creation of this issue, so I did not have that as a resource.

Full traceback:

Traceback (most recent call last):
  File "/home/user/dev/projects/project/min_repro.py", line 56, in <module>
    trainer.train()
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3783, in compute_loss
    outputs = model(**inputs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
    inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 207, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 89, in scatter_kwargs
    scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 75, in scatter
    res = scatter_map(inputs)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 66, in scatter_map
    return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 58, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 103, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 205, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: chunk expects at least a 1-dimensional tensor

Here is the minimal reproduction code that I used to produce this output:

from datasets import load_dataset
from transformers import AutoConfig, GPT2LMHeadModel, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=512,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

training_args = TrainingArguments(
    output_dir="./logs",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
)
model = GPT2LMHeadModel(config)

dataset = load_dataset("wikitext", "wikitext-2-v1", streaming=False)
def tokenize_mapping(examples):
    outputs = tokenizer(examples["text"], truncation=True, max_length=512, return_overflowing_tokens=True, return_length=True)
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == 512:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}
dataset = dataset.map(tokenize_mapping, batched=True, remove_columns=dataset["train"].column_names)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    processing_class=tokenizer,
)

print(f"Starting training with {sum(p.numel() for p in model.parameters()):,} parameters")
print(f"Model: {model}")
print(f"Tokenizer: {tokenizer}")

trainer.train()

Using this official doc as a guide.

Expected behavior

I would expect the very slightly modified training script to create and train a gpt2 model on the wikitext-2-v1 dataset. Instead, an exception is raised indicating some issue with the data parallelization integration. Also potentially of note is that I tried loading the pretrained gpt2 model and finetuning it with a modified version of this script (both with and without device_map='balanced' in the call to from_pretrained) and it still produces this chunking error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions