Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 22) #11

Open
yqwu94 opened this issue Nov 25, 2020 · 3 comments

Comments

@yqwu94
Copy link

yqwu94 commented Nov 25, 2020

Hi, I met a cuda runtime error as following:
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 22)
Recently, I am studying normalizing flow, such as Glow, however, a strange svd problem has arisen when I try to train Glow from scratch. In my opinion, due to Glow contains “tensor.slogdet()” operation in affine coupling layer, it may involve SVD decomposition, and thus casue above problem.
Specifically, I first use a small learning rate, such as 1e-6, the training loss begins to fall slowly. However, when the learning rate reaches 0.0004, the training loss has a sudden rise (inf) and the error information is presented as above.
How can I avoid this error during training process of Glow?

@kamenbliznashki
Copy link
Owner

kamenbliznashki commented Dec 31, 2020 via email

@Naagar
Copy link

Naagar commented Jan 5, 2021

Hi,
I'm also facing a similar problem.
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 11)

Dataset: mnist
torchvision 0.8.2
python 3.8.5
PyTorch 1.6.0
module load cudnn/7-cuda-10.0
model: Glow

" python -m torch.distributed.launch --nproc_per_node=3
flow_main.py --train
--distributed
--dataset=mnist
--n_levels=3
--depth=32
--width=512
--batch_size=16
--generate
--n_epochs=10 \ "

Error

File "flow_main.py", line 489, in train_epoch
loss.backward()
File "/home/sandeep.nagar/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/sandeep.nagar/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/sandeep.nagar/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/sandeep.nagar/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 11)

@pandya6988
Copy link

Any updates on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants