You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the above rules and searched the existing issues.
System Info
(llama_factory) PS D:\Ljx\Llama-finetuning> llamafactory-cli env
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
llamafactory version: 0.9.2.dev0
Platform: Windows-10-10.0.22631-SP0
Python version: 3.10.16
PyTorch version: 2.2.2+cu121 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 4090
Reproduction
(llama_factory) D:\L\LLaMA-Factory>llamafactory-cli webui
* Running on local URL: http://0.0.0.0:7860
To create a public link, set share=True in launch().
# ---------------UI界面配置好数据集后点击开始后发生如下报错-------------------
[INFO|2025-01-20 22:34:25] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:27432
[2025-01-20 22:34:26,079] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING]
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
launch()
launch() File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
run_exp()
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
run_exp()
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
_training_function(config={"args": args, "callbacks": callbacks})
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
return _parse_args(parser, args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return _parse_args(parser, args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
obj = dtype(**inputs)
File "<string>", line 142, in __init__
obj = dtype(**inputs)
File "<string>", line 142, in __init__
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
self.device
self.device File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
return self._setup_devices
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
return self._setup_devices
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
cached = self.fget(obj)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
cached = self.fget(obj)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
self.distributed_state = PartialState(**accelerator_state_kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
self.distributed_state = PartialState(**accelerator_state_kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
func_return = func(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
default_pg, _ = _new_process_group_helper(
default_pg, _ = _new_process_group_helper( File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
raise RuntimeError("Distributed package doesn't have NCCL built in")RuntimeError
: RuntimeErrorDistributed package doesn't have NCCL built in: Distributed package doesn't have NCCL built in
[2025-01-20 22:34:31,115] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15908) of binary: D:\Anaconda3\envs\llama_factory\python.exe
Traceback (most recent call last):
File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Anaconda3\envs\llama_factory\Scripts\torchrun.exe\__main__.py", line 7, in <module>
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
D:\L\LLaMA-Factory\src\llamafactory\launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-01-20_22:34:31
host : DESKTOP-K8BKR7S
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 16304)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-20_22:34:31
host : DESKTOP-K8BKR7S
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15908)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Others
Windows 环境下单机 双卡RTX4090,可以正常打开webui,但是配置好参数后,开始微调报错,求助
The text was updated successfully, but these errors were encountered:
Reminder
System Info
(llama_factory) PS D:\Ljx\Llama-finetuning> llamafactory-cli env
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
llamafactory
version: 0.9.2.dev0Reproduction
Others
Windows 环境下单机 双卡RTX4090,可以正常打开webui,但是配置好参数后,开始微调报错,求助
The text was updated successfully, but these errors were encountered: