Skip to content

Two errors: "The server socket has failed to listen on any local network address." and "subprocess.CalledProcessError: Command '['wget', '-P', '/root/.cache/vbench/amt_model', 'https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth']' returned non-zero exit status 4." #97

Closed
@EmmaThompson123

Description

@EmmaThompson123

Firstly I run vbench evaluate --dimension motion_smoothness --videos_path /user-fs/dataset/videos_mini/ --mode=custom_input,
the error is:

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 153, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 109, in main
    dist_init()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/distributed.py", line 37, in dist_init
    torch.distributed.init_process_group(backend=backend, init_method='env://')
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 73806) of binary: /opt/conda/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_11:56:55
  host      : 47vevg2begmd2-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 73806)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

then I add MASTER_PORT=29501 in the front of command: MASTER_PORT=29501 vbench evaluate --dimension motion_smooth ness --videos_path /user-fs/dataset/videos_mini/ --mode=custom_input,
it still throw error:

2025-01-07 11:57:19,779 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2025-01-07 11:57:19,779 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
args: Namespace(category=None, dimension=['motion_smoothness'], full_json_dir='/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../VBench_full_info.json', imaging_quality_preprocessing_mode='longer', load_ckpt_from_local=None, mode='custom_input', output_path='./evaluation_results/', prompt='None', prompt_file=None, read_frame=None, videos_path='/user-fs/dataset/videos_mini/')
start evaluation
File /root/.cache/vbench/amt_model/amt-s.pth does not exist. Downloading...
--2025-01-07 11:57:19--  https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth
Resolving huggingface.co (huggingface.co)... 157.240.15.8, 2a03:2880:f127:283:face:b00c:0:25de
Connecting to huggingface.co (huggingface.co)|157.240.15.8|:443... failed: Connection timed out.
Connecting to huggingface.co (huggingface.co)|2a03:2880:f127:283:face:b00c:0:25de|:443... failed: Network is unreachable.
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 153, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 139, in main
    my_VBench.evaluate(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/__init__.py", line 141, in evaluate
    submodules_dict = init_submodules(dimension_list, local=local, read_frame=read_frame)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/utils.py", line 275, in init_submodules
    subprocess.run(wget_command, check=True)
  File "/opt/conda/envs/openmmlab/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['wget', '-P', '/root/.cache/vbench/amt_model', 'https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth']' returned non-zero exit status 4.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 74247) of binary: /opt/conda/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_11:59:28
  host      : 47vevg2begmd2-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 74247)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I tried excute export HF_ENDPOINT=https://hf-mirror.com, but still got above errors

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions