Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maisi_diff_unet_training_tutorial.ipynb hit random Segmentation fault on H100 #1858

Open
KumoLiu opened this issue Oct 9, 2024 · 1 comment

Comments

@KumoLiu
Copy link
Contributor

KumoLiu commented Oct 9, 2024

Executing Cell 19--------------------------------------
INFO:notebook:Training the model...


INFO:training:Using cuda:0 of 1
INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models.
INFO:training:[config] data_root -> ./temp_work_dir/./embeddings.
INFO:training:[config] data_list -> ./temp_work_dir/sim_datalist.json.
INFO:training:[config] lr -> 0.0001.
INFO:training:[config] num_epochs -> 2.
INFO:training:[config] num_train_timesteps -> 1000.
INFO:training:num_files_train: 2
INFO:training:Training from scratch.
INFO:training:Scaling factor set to 1.159390926361084.
INFO:training:scale_factor -> 1.159390926361084.
INFO:training:torch.set_float32_matmul_precision -> highest.
INFO:training:Epoch 1, lr 0.0001.
WARNING:py.warnings:cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/cudnn/MHA.cpp:667.)

INFO:training:[2024-09-30 10:58:31] epoch 1, iter 1/2, loss: 0.7987, lr: 0.000100000000.
INFO:training:[2024-09-30 10:58:31] epoch 1, iter 2/2, loss: 0.7931, lr: 0.000056250000.
INFO:training:epoch 1 average loss: 0.7959.
INFO:training:Epoch 2, lr 2.5e-05.
INFO:training:[2024-09-30 10:58:32] epoch 2, iter 1/2, loss: 0.7952, lr: 0.000025000000.
INFO:training:[2024-09-30 10:58:32] epoch 2, iter 2/2, loss: 0.7875, lr: 0.000006250000.
INFO:training:epoch 2 average loss: 0.7913.
[ipp2-0112:385  :0:987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x64616f74)
==== backtrace (tid:    987) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000459e0 __cxa_finalize()  ???:0
 2 0x000000000030ee76 ???()  /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12:0
=================================
E0930 11:15:58.328000 140660921045632 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -11) local_rank: 0 (pid: 385) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
scripts.diff_model_train FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-30_11:15:58
  host      : ipp2-0112.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 385)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 385
=====================================================
KumoLiu added a commit to KumoLiu/tutorials that referenced this issue Oct 9, 2024
Signed-off-by: YunLiu <[email protected]>
@KumoLiu
Copy link
Contributor Author

KumoLiu commented Oct 9, 2024

After investigate, I find set amp=False will not raise the issue. I add an argument in #1857 to to be a workaround for this issue.

cc @dongyang0122

KumoLiu added a commit that referenced this issue Oct 9, 2024
Enhance readme for ddp cases in ldm tutorials
Add amp argument in maisi diffusion training notebook as a workaround
for #1858.

### Checks
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Avoid including large-size files in the PR.
- [ ] Clean up long text outputs from code cells in the notebook.
- [ ] For security purposes, please check the contents and remove any
sensitive info such as user names and private key.
- [ ] Ensure (1) hyperlinks and markdown anchors are working (2) use
relative paths for tutorial repo files (3) put figure and graphs in the
`./figure` folder
- [ ] Notebook runs automatically `./runner.sh -t <path to .ipynb file>`

---------

Signed-off-by: YunLiu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant