Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adam optim ERROR:If capturable=False, state_steps should not be CUDA tensors. #31

Open
yqi19 opened this issue Jul 3, 2022 · 4 comments

Comments

@yqi19
Copy link

yqi19 commented Jul 3, 2022

Hi, congratulations on your excellent work!
I would really appreciate if you could help me through this.
So I run

PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_0_0 --common.override-kwargs scheduler.cosine.max_lr=0.0075 scheduler.cosine.min_lr=0.00075 optim.weight_decay=0.013 model.classification.mitv2.width_multiplier=1.00 --common.tensorboard-logging --common.accum-freq 4 --common.auto-resume 

and trigger the auto-resume mode to continue my last training, and this error occurs

2022-07-03 06:06:18 - LOGS    - Exception occurred that interrupted the training. If capturable=False, state_steps shou
ld not be CUDA tensors.
If capturable=False, state_steps should not be CUDA tensors.

Traceback (most recent call last):                                                                           
  File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 682, in run
    train_loss, train_ckpt_metric = self.train_epoch(epoch)
  File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 353, in train_epoch
    self.gradient_scalar.step(optimizer=self.optimizer)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _may
be_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorat
e_context
    return func(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
    adamw(params_with_grad,
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
    func(params,
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tenso
r_adamw
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."

And I am 100% sure that CUDNN is enabled, all gpus are available, nothing wrong happens when I first train this.

And here's a nother problem, do you guys have a clue if the training process is slow?
Thanks sooooo much!

@yqi19 yqi19 changed the title A problem encountered when loading checkpoint.pt to continue training adam optim ERROR:If capturable=False, state_steps should not be CUDA tensors. Jul 3, 2022
@yqi19
Copy link
Author

yqi19 commented Jul 3, 2022

and my versions:

PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23

Python version: 3.9.12 (main, Jun  1 2022, 11:38:51)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.4.0-210-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA TITAN Xp
GPU 1: NVIDIA TITAN Xp
GPU 2: NVIDIA TITAN Xp
GPU 3: NVIDIA TITAN Xp

Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.0
[pip3] pytorchvideo==0.1.5
[pip3] torch==1.12.0
[pip3] torchvision==0.13.0
[conda] numpy                     1.23.0                   pypi_0    pypi
[conda] pytorchvideo              0.1.5                    pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi
[conda] torchvision               0.13.0                   pypi_0    pypi

@yqi19
Copy link
Author

yqi19 commented Jul 3, 2022

Now I update my cuda to 11.3, but the result doesn't change

@sacmehta
Copy link
Collaborator

sacmehta commented Jul 6, 2022

@yqi19 It seems that the training fails when trying to load the optimizer states. Could you set capturable=True flag in AdamW optimizer and see if that resolves the issue?

@prsbsvrn
Copy link

prsbsvrn commented May 9, 2023

I have the same problem, I tried to set capturable=True flag in [AdamW optimizer] but nothing changed. I received this error: "AssertionError: If capturable=False, state_steps should not be CUDA tensors.".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants