Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two errors: "The server socket has failed to listen on any local network address." and "subprocess.CalledProcessError: Command '['wget', '-P', '/root/.cache/vbench/amt_model', 'https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth']' returned non-zero exit status 4." #97

Open
EmmaThompson123 opened this issue Jan 7, 2025 · 1 comment

Comments

@EmmaThompson123
Copy link

Firstly I run vbench evaluate --dimension motion_smoothness --videos_path /user-fs/dataset/videos_mini/ --mode=custom_input,
the error is:

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 153, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 109, in main
    dist_init()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/distributed.py", line 37, in dist_init
    torch.distributed.init_process_group(backend=backend, init_method='env://')
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 73806) of binary: /opt/conda/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_11:56:55
  host      : 47vevg2begmd2-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 73806)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

then I add MASTER_PORT=29501 in the front of command: MASTER_PORT=29501 vbench evaluate --dimension motion_smooth ness --videos_path /user-fs/dataset/videos_mini/ --mode=custom_input,
it still throw error:

2025-01-07 11:57:19,779 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2025-01-07 11:57:19,779 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
args: Namespace(category=None, dimension=['motion_smoothness'], full_json_dir='/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../VBench_full_info.json', imaging_quality_preprocessing_mode='longer', load_ckpt_from_local=None, mode='custom_input', output_path='./evaluation_results/', prompt='None', prompt_file=None, read_frame=None, videos_path='/user-fs/dataset/videos_mini/')
start evaluation
File /root/.cache/vbench/amt_model/amt-s.pth does not exist. Downloading...
--2025-01-07 11:57:19--  https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth
Resolving huggingface.co (huggingface.co)... 157.240.15.8, 2a03:2880:f127:283:face:b00c:0:25de
Connecting to huggingface.co (huggingface.co)|157.240.15.8|:443... failed: Connection timed out.
Connecting to huggingface.co (huggingface.co)|2a03:2880:f127:283:face:b00c:0:25de|:443... failed: Network is unreachable.
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 153, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py", line 139, in main
    my_VBench.evaluate(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/__init__.py", line 141, in evaluate
    submodules_dict = init_submodules(dimension_list, local=local, read_frame=read_frame)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/utils.py", line 275, in init_submodules
    subprocess.run(wget_command, check=True)
  File "/opt/conda/envs/openmmlab/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['wget', '-P', '/root/.cache/vbench/amt_model', 'https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth']' returned non-zero exit status 4.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 74247) of binary: /opt/conda/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/openmmlab/lib/python3.8/site-packages/vbench/cli/../launch/evaluate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_11:59:28
  host      : 47vevg2begmd2-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 74247)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I tried excute export HF_ENDPOINT=https://hf-mirror.com, but still got above errors

@NattapolChan
Copy link
Collaborator

Seems like it failed during downloading AMT model. Can you try downloading the model separately and run the command again?

wget https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth -P /root/.cache/vbench/amt_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants