We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the FAQs and existing/past issues
I only have 1 GPU, when I run the test code, the bug showed and I don't know how to stop the distributed training.
#torchrun --nnodes 1 --nproc_per_node 1 example_chat_completion.py \ --ckpt_dir /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/llama-2-7b/ \ --tokenizer_path /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/tokenizer.model \ --max_seq_len 128 --max_batch_size 1
<ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 65253) of binary: /home/liuyunhe/anaconda3/envs/LLMA/bin/py Traceback (most recent call last): File "/home/liuyunhe/anaconda3/envs/LLMA/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapp return f(*args, **kwargs) File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/liuyunhe/anaconda3/envs/LLMA/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-28_16:01:51 host : liulab-rtx8k rank : 0 (local_rank: 0) exitcode : 1 (pid: 65253) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
<paste stacktrace and other outputs here>
llama-2-7b-chat
Additional context Add any other context about the problem or environment here.
The text was updated successfully, but these errors were encountered:
I am getting same error. Have you solved that error?
Sorry, something went wrong.
No branches or pull requests
Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the FAQs and existing/past issues
Describe the bug
I only have 1 GPU, when I run the test code, the bug showed and I don't know how to stop the distributed training.
#torchrun --nnodes 1 --nproc_per_node 1 example_chat_completion.py \ --ckpt_dir /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/llama-2-7b/ \ --tokenizer_path /home/liuyunhe/Project_Unsupervised_clustering/Methods/LLMA/llma2_7b/tokenizer.model \ --max_seq_len 128 --max_batch_size 1
Output
Runtime Environment
llama-2-7b-chat
]Additional context
Add any other context about the problem or environment here.
The text was updated successfully, but these errors were encountered: