We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executing Cell 19-------------------------------------- INFO:notebook:Training the model... INFO:training:Using cuda:0 of 1 INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models. INFO:training:[config] data_root -> ./temp_work_dir/./embeddings. INFO:training:[config] data_list -> ./temp_work_dir/sim_datalist.json. INFO:training:[config] lr -> 0.0001. INFO:training:[config] num_epochs -> 2. INFO:training:[config] num_train_timesteps -> 1000. INFO:training:num_files_train: 2 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/toolkit/tutorials/monai/generation/maisi/scripts/diff_model_train.py", line 434, in <module> diff_model_train(args.env_config, args.model_config, args.model_def) File "/opt/toolkit/tutorials/monai/generation/maisi/scripts/diff_model_train.py", line 355, in diff_model_train data=train_files, shuffle=True, num_partitions=dist.get_world_size(), even_divisible=True File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2002, in get_world_size return _get_group_size(group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 987, in _get_group_size default_pg = _get_default_group() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1151, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group. E0930 01:33:38.255000 140127265727104 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: 1) local_rank: 0 (pid: 383) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scripts.diff_model_train FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-30_01:33:38 host : ipp2-0112.ipp2u1.colossus.nvidia.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 383) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
The text was updated successfully, but these errors were encountered:
fix Project-MONAI#1850
405291f
Signed-off-by: YunLiu <[email protected]>
No branches or pull requests
The text was updated successfully, but these errors were encountered: