-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Description
System Info
transformers-cli env output:
transformersversion: 4.50.3- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.29.3
- Safetensors version: 0.5.3
- Accelerate version: 1.5.2
- Accelerate config: not found
- DeepSpeed version: 0.16.4
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes, implicitly
- Using GPU in script?: yes
- GPU type: NVIDIA GeForce RTX 4090
Other pertinent versions:
(venv) user@vm:~/dev/projects/project$ python3 -c 'import torch; print(torch.version.cuda)'
12.4
(venv) user@vm:~/dev/projects/project$ nvidia-smi
...
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
...
And it doesn't show except in the extended nvidia-smi output, but there are two 4090s on this device and I can tell that the trainer is using them both by default because of a spike in memory usage and GPU% when the script starts.
Who can help?
@zach-huggingface @ArthurZucker
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
First off, the docs bot is not working as of the creation of this issue, so I did not have that as a resource.
Full traceback:
Traceback (most recent call last):
File "/home/user/dev/projects/project/min_repro.py", line 56, in <module>
trainer.train()
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
return inner_training_loop(
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3783, in compute_loss
outputs = model(**inputs)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 207, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 89, in scatter_kwargs
scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 75, in scatter
res = scatter_map(inputs)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 66, in scatter_map
return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 58, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 103, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/user/dev/projects/project/venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 205, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: chunk expects at least a 1-dimensional tensor
Here is the minimal reproduction code that I used to produce this output:
from datasets import load_dataset
from transformers import AutoConfig, GPT2LMHeadModel, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=512,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
training_args = TrainingArguments(
output_dir="./logs",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
evaluation_strategy="steps",
eval_steps=5_000,
logging_steps=5_000,
gradient_accumulation_steps=8,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
)
model = GPT2LMHeadModel(config)
dataset = load_dataset("wikitext", "wikitext-2-v1", streaming=False)
def tokenize_mapping(examples):
outputs = tokenizer(examples["text"], truncation=True, max_length=512, return_overflowing_tokens=True, return_length=True)
input_batch = []
for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
if length == 512:
input_batch.append(input_ids)
return {"input_ids": input_batch}
dataset = dataset.map(tokenize_mapping, batched=True, remove_columns=dataset["train"].column_names)
trainer = Trainer(
model=model,
args=training_args,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
processing_class=tokenizer,
)
print(f"Starting training with {sum(p.numel() for p in model.parameters()):,} parameters")
print(f"Model: {model}")
print(f"Tokenizer: {tokenizer}")
trainer.train()Using this official doc as a guide.
Expected behavior
I would expect the very slightly modified training script to create and train a gpt2 model on the wikitext-2-v1 dataset. Instead, an exception is raised indicating some issue with the data parallelization integration. Also potentially of note is that I tried loading the pretrained gpt2 model and finetuning it with a modified version of this script (both with and without device_map='balanced' in the call to from_pretrained) and it still produces this chunking error.