Open
Description
Hi, congratulations on your excellent work!
I would really appreciate if you could help me through this.
So I run
PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_0_0 --common.override-kwargs scheduler.cosine.max_lr=0.0075 scheduler.cosine.min_lr=0.00075 optim.weight_decay=0.013 model.classification.mitv2.width_multiplier=1.00 --common.tensorboard-logging --common.accum-freq 4 --common.auto-resume
and trigger the auto-resume
mode to continue my last training, and this error occurs
2022-07-03 06:06:18 - LOGS - Exception occurred that interrupted the training. If capturable=False, state_steps shou
ld not be CUDA tensors.
If capturable=False, state_steps should not be CUDA tensors.
Traceback (most recent call last):
File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 682, in run
train_loss, train_ckpt_metric = self.train_epoch(epoch)
File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 353, in train_epoch
self.gradient_scalar.step(optimizer=self.optimizer)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _may
be_opt_step
retval = optimizer.step(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorat
e_context
return func(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
adamw(params_with_grad,
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
func(params,
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tenso
r_adamw
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
And I am 100% sure that CUDNN is enabled, all gpus are available, nothing wrong happens when I first train this.
And here's a nother problem, do you guys have a clue if the training process is slow?
Thanks sooooo much!
Metadata
Metadata
Assignees
Labels
No labels