torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1084

Qiqing-Fu · 2024-03-28T08:41:53Z

Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the FAQs and existing/past issues

Describe the bug

I only have 1 GPU, when I run the test code, the bug showed and I don't know how to stop the distributed training.

#torchrun --nnodes 1 --nproc_per_node 1 example_chat_completion.py \     --ckpt_dir /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/llama-2-7b/ \     --tokenizer_path /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/tokenizer.model \     --max_seq_len 128 --max_batch_size 1

Output

<ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 65253) of binary: /home/liuyunhe/anaconda3/envs/LLMA/bin/py
Traceback (most recent call last):
  File "/home/liuyunhe/anaconda3/envs/LLMA/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapp
    return f(*args, **kwargs)
  File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-28_16:01:51
  host      : liulab-rtx8k
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 65253)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

<paste stacktrace and other outputs here>

Runtime Environment

Model: [eg: llama-2-7b-chat]
Using via huggingface?: [no]
OS: [eg. Linux]
GPU VRAM: 40G
Number of GPUs: 1
GPU Make: Nvidia

Additional context
Add any other context about the problem or environment here.

The text was updated successfully, but these errors were encountered:

armitamani · 2024-04-11T17:18:16Z

I am getting same error. Have you solved that error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1084

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1084

Qiqing-Fu commented Mar 28, 2024

armitamani commented Apr 11, 2024

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1084

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1084

Comments

Qiqing-Fu commented Mar 28, 2024

Describe the bug

Output

Runtime Environment

armitamani commented Apr 11, 2024