Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED when loss.backward() in train.py #1

Open
chiendoanngoc opened this issue Jan 8, 2022 · 3 comments

Comments

@chiendoanngoc
Copy link

Thanks for your great work, your code is so much cleaner that I could easily understand.
I just had an error raised in train.py when loss.backward(). The error is [RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED.
Have you ever seen this before and do you have any suggestion to fix this? Thanks a lot!

@amlarraz
Copy link
Owner

amlarraz commented Jan 10, 2022

Hi @chiendoanngoc ! You're welcome! I've faced the same issue and I fixed that by using another version of PyTorch. Actually I'm using version: 1.9.0+cu111 however it depends on your CUDA version. You can find all previous pytorch versions here

I just changed the README file to avoid confusion about the Pytorch version.

@mk-hassan
Copy link

mk-hassan commented Jul 2, 2022

HELLO, @amlarraz @chiendoanngoc I had changed the torch version to 1.9.0+cu111 but I still got the same error. I used Colab as working environment.

  cpuset_checked))
Logdir: ./logs/combination-2_7_2022-18h40m33s
Train epoch: 1:   0%|          | 0/1113 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-1-13d8a8766d4b>](https://localhost:8080/#) in <module>()
     60         loss = criterion(pred_3, pred_canny, pred_1, pred_2, msk, canny_label)
     61         loss = loss/accumulation_steps
---> 62         loss.backward()
     63         # accumulative gradient
     64         if (i + 1) % accumulation_steps == 0:  # Wait for several backward steps

1 frames
[/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    147     Variable._execution_engine.run_backward(
    148         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    150 
    151 

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED```

import torch
torch.version

1.9.0+cu111

@amlarraz
Copy link
Owner

amlarraz commented Jul 4, 2022

Hi @Twixii99, which CUDA version are you using? Remember that the PyTorch version depends on the CUDA version you're using. Ifyou're using this PyTorch version and the colab enviroment is using a different CUDA version than 11.1 PyTorch will give you some errors. To know which CUDA version you're using you can run the command: !nvidia-smi in one cell. To choose the correct PyTorch version according with your CUDA version you can visit this page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants