Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When training the model, the loss is nan. #44

Open
CrossEntropy opened this issue Mar 5, 2020 · 3 comments
Open

When training the model, the loss is nan. #44

CrossEntropy opened this issue Mar 5, 2020 · 3 comments

Comments

@CrossEntropy
Copy link

Hi, @shamangary !
I got the following error while training the model FSA_net_Var_Capsules

6120/6120 [==============================] - 459s 75ms/step - loss: 10.4547 - val_loss: 7.6488

Epoch 00001: val_loss improved from inf to 7.64882, saving model to 300W_LP_checkpoints/weights.01-7.65.hdf5
Epoch 2/90
6120/6120 [==============================] - 425s 69ms/step - loss: 7.2023 - val_loss: 5.7376

Epoch 00002: val_loss improved from 7.64882 to 5.73757, saving model to 300W_LP_checkpoints/weights.02-5.74.hdf5
Epoch 3/90
6120/6120 [==============================] - 442s 72ms/step - loss: 6.0585 - val_loss: 5.1815

Epoch 00003: val_loss improved from 5.73757 to 5.18146, saving model to 300W_LP_checkpoints/weights.03-5.18.hdf5
Epoch 4/90
6120/6120 [==============================] - 431s 70ms/step - loss: nan - val_loss: nan

Epoch 00004: val_loss did not improve from 5.18146
Epoch 5/90
6120/6120 [==============================] - 425s 69ms/step - loss: nan - val_loss: nan

Epoch 00005: val_loss did not improve from 5.18146
Epoch 6/90
6120/6120 [==============================] - 424s 69ms/step - loss: nan - val_loss: nan

Epoch 00006: val_loss did not improve from 5.18146
Epoch 7/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

Epoch 00007: val_loss did not improve from 5.18146
Epoch 8/90
6120/6120 [==============================] - 421s 69ms/step - loss: nan - val_loss: nan

Epoch 00008: val_loss did not improve from 5.18146
Epoch 9/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

And the same phenomenon also appeared in the model I built myself, my model only replaced the ssr_G_model_build part.
Thanks for your help!

@CrossEntropy
Copy link
Author

CrossEntropy commented Mar 5, 2020

When I use tesnorflow2.0, I set the BatchSize to 128, although the nan will appear, the model still recycles the face. This is really amazing. ToT
I suspect it may be a problem with the score function processing method. As you described in your paper, there are three methods:

(1) variance
(2) 1x1 convolution
(3) uniform.

I think the method of variance can reduce the amount of parameters, so I choice it. Looking forward to your reply!

@CrossEntropy CrossEntropy changed the title When training the model, the loss is nan. When training the model, the loss is nan. Mar 5, 2020
@CrossEntropy CrossEntropy changed the title When training the model, the loss is nan. When training the model, the loss is nan. Mar 5, 2020
@shamangary
Copy link
Owner

Hello @CrossEntropy,

It's been a long time since I ran this repo. My suggestion is use smaller batch like 32 or 16, and use a lower version of Tensorflow and Keras since they have updated it recently.

@shamangary
Copy link
Owner

tensorflow/tensorflow#3290
tensorflow/tensorflow#8101
It seems like tf.nn.moments could possibly return nan. You may pick out the nana from the variance and put zero back in. I assume this would solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants