Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Default process group has not been initialized in maisi diffusion train #1850

Open
KumoLiu opened this issue Sep 30, 2024 · 0 comments

Comments

@KumoLiu
Copy link
Contributor

KumoLiu commented Sep 30, 2024

Executing Cell 19--------------------------------------
INFO:notebook:Training the model...


INFO:training:Using cuda:0 of 1
INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models.
INFO:training:[config] data_root -> ./temp_work_dir/./embeddings.
INFO:training:[config] data_list -> ./temp_work_dir/sim_datalist.json.
INFO:training:[config] lr -> 0.0001.
INFO:training:[config] num_epochs -> 2.
INFO:training:[config] num_train_timesteps -> 1000.
INFO:training:num_files_train: 2
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/toolkit/tutorials/monai/generation/maisi/scripts/diff_model_train.py", line 434, in <module>
    diff_model_train(args.env_config, args.model_config, args.model_def)
  File "/opt/toolkit/tutorials/monai/generation/maisi/scripts/diff_model_train.py", line 355, in diff_model_train
    data=train_files, shuffle=True, num_partitions=dist.get_world_size(), even_divisible=True
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2002, in get_world_size
    return _get_group_size(group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 987, in _get_group_size
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1151, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
E0930 01:33:38.255000 140127265727104 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: 1) local_rank: 0 (pid: 383) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.diff_model_train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-30_01:33:38
  host      : ipp2-0112.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 383)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
KumoLiu added a commit to KumoLiu/tutorials that referenced this issue Sep 30, 2024
Signed-off-by: YunLiu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant