UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

stromyu520 · 2024-05-08T17:13:28Z

W0509 01:09:39.797000 8201419456 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled

warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
Traceback (most recent call last):

fbnav · 2024-05-08T17:34:05Z

Hi, could you provide more information on what platform/OS you are trying to run it on? Also, please try reinstalling PyTorch and try running it again. You can do it from here : https://pytorch.org/get-started/locally/

tungts1101 · 2024-05-10T04:16:38Z

I got the same error. My OS is windows 11. Here is what I got with pip show torch

Name: torch
Version: 2.3.0+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: d:\anaconda3\envs\system\lib\site-packages
Requires: filelock, fsspec, jinja2, mkl, networkx, sympy, typing-extensions
Required-by: fairscale, llama3, torchaudio, torchvision

tungts1101 · 2024-05-10T04:52:45Z

It seems like Windows doesn't support NCCL backend. Does it mean that I can only run llama3 on linux based machine?

tungts1101 · 2024-05-10T06:03:41Z

I have tried again with my Ubuntu 22.04 installed under WSL. The nccl error has disappeared but I still get this error when trying to run the example

E0510 15:01:32.269000 139785519843136 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 401) of binary: /home/tran/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/tran/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

fbnav · 2024-05-10T16:26:12Z

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

tungts1101 · 2024-05-12T11:03:24Z

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

Above is the complete error message when I try to run the example example_chat_completion.py in README file. The OS is Ubuntu 22.04 with Intel core i5, RTX 3050 Laptop GPU.

tungts1101 · 2024-05-13T11:46:50Z

I think the root cause is the hardware doesn't meet the minimum requirement to run the llama-7B model.

fbnav · 2024-05-13T16:17:19Z

Yes it might be that. You will need a min VRAM of ~16GB to run the 8B model in fp16 precision.

fbnav · 2024-05-21T16:44:33Z

Closing this issue. Feel free to re-open if the issue persists.

fbnav self-assigned this May 8, 2024

fbnav added the needs-more-information Issue is not fully clear to be acted upon label May 8, 2024

fbnav closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

stromyu520 commented May 8, 2024

fbnav commented May 8, 2024

tungts1101 commented May 10, 2024 •

edited

tungts1101 commented May 10, 2024

tungts1101 commented May 10, 2024

fbnav commented May 10, 2024 •

edited

tungts1101 commented May 12, 2024

tungts1101 commented May 13, 2024

fbnav commented May 13, 2024

fbnav commented May 21, 2024

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

Comments

stromyu520 commented May 8, 2024

fbnav commented May 8, 2024

tungts1101 commented May 10, 2024 • edited

tungts1101 commented May 10, 2024

tungts1101 commented May 10, 2024

fbnav commented May 10, 2024 • edited

tungts1101 commented May 12, 2024

tungts1101 commented May 13, 2024

fbnav commented May 13, 2024

fbnav commented May 21, 2024

tungts1101 commented May 10, 2024 •

edited

fbnav commented May 10, 2024 •

edited