When training the model, the loss is nan. #44

CrossEntropy · 2020-03-05T11:09:18Z

Hi, @shamangary !
I got the following error while training the model FSA_net_Var_Capsules

6120/6120 [==============================] - 459s 75ms/step - loss: 10.4547 - val_loss: 7.6488

Epoch 00001: val_loss improved from inf to 7.64882, saving model to 300W_LP_checkpoints/weights.01-7.65.hdf5
Epoch 2/90
6120/6120 [==============================] - 425s 69ms/step - loss: 7.2023 - val_loss: 5.7376

Epoch 00002: val_loss improved from 7.64882 to 5.73757, saving model to 300W_LP_checkpoints/weights.02-5.74.hdf5
Epoch 3/90
6120/6120 [==============================] - 442s 72ms/step - loss: 6.0585 - val_loss: 5.1815

Epoch 00003: val_loss improved from 5.73757 to 5.18146, saving model to 300W_LP_checkpoints/weights.03-5.18.hdf5
Epoch 4/90
6120/6120 [==============================] - 431s 70ms/step - loss: nan - val_loss: nan

Epoch 00004: val_loss did not improve from 5.18146
Epoch 5/90
6120/6120 [==============================] - 425s 69ms/step - loss: nan - val_loss: nan

Epoch 00005: val_loss did not improve from 5.18146
Epoch 6/90
6120/6120 [==============================] - 424s 69ms/step - loss: nan - val_loss: nan

Epoch 00006: val_loss did not improve from 5.18146
Epoch 7/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

Epoch 00007: val_loss did not improve from 5.18146
Epoch 8/90
6120/6120 [==============================] - 421s 69ms/step - loss: nan - val_loss: nan

Epoch 00008: val_loss did not improve from 5.18146
Epoch 9/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

And the same phenomenon also appeared in the model I built myself, my model only replaced the ssr_G_model_build part.
Thanks for your help!

The text was updated successfully, but these errors were encountered:

CrossEntropy · 2020-03-05T11:27:48Z

When I use tesnorflow2.0, I set the BatchSize to 128, although the nan will appear, the model still recycles the face. This is really amazing. ToT
I suspect it may be a problem with the score function processing method. As you described in your paper, there are three methods:

(1) variance
(2) 1x1 convolution
(3) uniform.

I think the method of variance can reduce the amount of parameters, so I choice it. Looking forward to your reply!

shamangary · 2020-03-05T14:01:26Z

Hello @CrossEntropy,

It's been a long time since I ran this repo. My suggestion is use smaller batch like 32 or 16, and use a lower version of Tensorflow and Keras since they have updated it recently.

shamangary · 2020-03-07T22:28:49Z

tensorflow/tensorflow#3290
tensorflow/tensorflow#8101
It seems like tf.nn.moments could possibly return nan. You may pick out the nana from the variance and put zero back in. I assume this would solve the issue.

CrossEntropy changed the title ~~When training the model, the loss is nan.~~ When training the model, the loss is nan. Mar 5, 2020

CrossEntropy changed the title ~~When training the model, the loss is nan.~~ When training the model, the loss is nan. Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When training the model, the loss is nan. #44

When training the model, the loss is nan. #44

CrossEntropy commented Mar 5, 2020

CrossEntropy commented Mar 5, 2020 •

edited

shamangary commented Mar 5, 2020

shamangary commented Mar 7, 2020

When training the model, the loss is nan. #44

When training the model, the loss is nan. #44

Comments

CrossEntropy commented Mar 5, 2020

CrossEntropy commented Mar 5, 2020 • edited

shamangary commented Mar 5, 2020

shamangary commented Mar 7, 2020

CrossEntropy commented Mar 5, 2020 •

edited