Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

chenxingyu-cs · 2023-06-19T18:25:36Z

🐛 Describe the bug

Hi, we found some strange during using Dataloader2. Here's some details about the issue.

We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
In HuggingFace training, it will call evaluation every traininig_args.eval_steps training steps.
I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using IterableDataPipe with ShardingFilterIterDataPipe
The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)

Can you help provide some context on what could be the root cause and how to fix this? Thanks!

Log:



  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
  | 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
  | 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
  | 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
  | 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
  | 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
  | 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
  | 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
  | 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
  | 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  | 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
  | 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
  | 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchsnapshot-nightly     2023.3.15                pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] torchx-nightly            2023.5.25                pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

The text was updated successfully, but these errors were encountered:

chenxingyu-cs · 2023-06-20T19:51:44Z

@ejguan Hi can you help provide some insights you have? Great thanks!

ejguan · 2023-06-20T21:06:35Z

Are you running multiple DPP at the same time?

chenxingyu-cs · 2023-06-22T17:30:45Z

@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

chenxingyu-cs commented Jun 19, 2023

chenxingyu-cs commented Jun 20, 2023

ejguan commented Jun 20, 2023

chenxingyu-cs commented Jun 22, 2023

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

Comments

chenxingyu-cs commented Jun 19, 2023

🐛 Describe the bug

Versions

chenxingyu-cs commented Jun 20, 2023

ejguan commented Jun 20, 2023

chenxingyu-cs commented Jun 22, 2023