Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows下微调llama3.1-instruction 开始训练后报错 无法启动微调 #6725

Open
1 task done
LJXCMQ opened this issue Jan 21, 2025 · 1 comment
Open
1 task done
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@LJXCMQ
Copy link

LJXCMQ commented Jan 21, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

(llama_factory) PS D:\Ljx\Llama-finetuning> llamafactory-cli env
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).

  • llamafactory version: 0.9.2.dev0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.10.16
  • PyTorch version: 2.2.2+cu121 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090

Reproduction

(llama_factory) D:\L\LLaMA-Factory>llamafactory-cli webui
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set share=True in launch().
# ---------------UI界面配置好数据集后点击开始后发生如下报错-------------------
[INFO|2025-01-20 22:34:25] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:27432
[2025-01-20 22:34:26,079] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING]
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
    launch()
launch()  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch

  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
    run_exp()
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
    run_exp()
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
    _training_function(config={"args": args, "callbacks": callbacks})
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
    return _parse_args(parser, args)
      File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return _parse_args(parser, args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
    return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
    return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
    obj = dtype(**inputs)
  File "<string>", line 142, in __init__
    obj = dtype(**inputs)
  File "<string>", line 142, in __init__
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
    Seq2SeqTrainingArguments.__post_init__(self)
      File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
    self.device
    self.device  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device

  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
    return self._setup_devices
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
    return self._setup_devices
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
    cached = self.fget(obj)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
    cached = self.fget(obj)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
      File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
func_return = func(*args, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
default_pg, _ = _new_process_group_helper(  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper

  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
    raise RuntimeError("Distributed package doesn't have NCCL built in")RuntimeError
: RuntimeErrorDistributed package doesn't have NCCL built in: Distributed package doesn't have NCCL built in

[2025-01-20 22:34:31,115] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15908) of binary: D:\Anaconda3\envs\llama_factory\python.exe
Traceback (most recent call last):
  File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Anaconda3\envs\llama_factory\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 812, in main
    run(args)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 803, in run
    elastic_launch(
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
D:\L\LLaMA-Factory\src\llamafactory\launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-20_22:34:31
  host      : DESKTOP-K8BKR7S
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 16304)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-20_22:34:31
  host      : DESKTOP-K8BKR7S
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15908)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Others

Windows 环境下单机 双卡RTX4090,可以正常打开webui,但是配置好参数后,开始微调报错,求助

@LJXCMQ LJXCMQ added bug Something isn't working pending This problem is yet to be addressed labels Jan 21, 2025
@pie2cookie
Copy link

Have you tried to solve this?
raise RuntimeError("Distributed package doesn't have NCCL built in")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants