Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

Closed
stromyu520 opened this issue May 8, 2024 · 9 comments
Assignees
Labels
needs-more-information Issue is not fully clear to be acted upon

Comments

@stromyu520
Copy link

W0509 01:09:39.797000 8201419456 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled

warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
Traceback (most recent call last):

@fbnav fbnav self-assigned this May 8, 2024
@fbnav
Copy link
Contributor

fbnav commented May 8, 2024

Hi, could you provide more information on what platform/OS you are trying to run it on? Also, please try reinstalling PyTorch and try running it again. You can do it from here : https://pytorch.org/get-started/locally/

@fbnav fbnav added the needs-more-information Issue is not fully clear to be acted upon label May 8, 2024
@tungts1101
Copy link

tungts1101 commented May 10, 2024

I got the same error. My OS is windows 11. Here is what I got with pip show torch

Name: torch
Version: 2.3.0+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: d:\anaconda3\envs\system\lib\site-packages
Requires: filelock, fsspec, jinja2, mkl, networkx, sympy, typing-extensions
Required-by: fairscale, llama3, torchaudio, torchvision

@tungts1101
Copy link

It seems like Windows doesn't support NCCL backend. Does it mean that I can only run llama3 on linux based machine?

@tungts1101
Copy link

I have tried again with my Ubuntu 22.04 installed under WSL. The nccl error has disappeared but I still get this error when trying to run the example

E0510 15:01:32.269000 139785519843136 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 401) of binary: /home/tran/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/tran/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@fbnav
Copy link
Contributor

fbnav commented May 10, 2024

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

@tungts1101
Copy link

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

Above is the complete error message when I try to run the example example_chat_completion.py in README file. The OS is Ubuntu 22.04 with Intel core i5, RTX 3050 Laptop GPU.

@tungts1101
Copy link

I think the root cause is the hardware doesn't meet the minimum requirement to run the llama-7B model.

@fbnav
Copy link
Contributor

fbnav commented May 13, 2024

Yes it might be that. You will need a min VRAM of ~16GB to run the 8B model in fp16 precision.

@fbnav
Copy link
Contributor

fbnav commented May 21, 2024

Closing this issue. Feel free to re-open if the issue persists.

@fbnav fbnav closed this as completed May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-more-information Issue is not fully clear to be acted upon
Projects
None yet
Development

No branches or pull requests

3 participants