Loss increasing with binary_cross_entropy above version 0.4.1 #567

preciz · 2024-04-15T18:05:12Z

I had an issue where I wanted to retrain a model with new data but the loss was increasing even though we haven't changed the code.

Previously discussed on Elixir Forum:
https://elixirforum.com/t/model-no-longer-learning-any-ideas/62883/6

Notebooks reproducing the issue with data:
https://github.com/preciz/not_learning

Surprisingly the solution was to revert Axon version to 0.4.1 or to change from binary_cross_entropy loss function to categorical_cross_entropy and change the activation to softmax.

Since the exact same model is learning fine on Axon version 0.4.1 I wonder if this is a bug and there should be an issue about it.

The text was updated successfully, but these errors were encountered:

seanmor5 · 2024-04-16T10:19:17Z

Ah yeah I have had issues with BCE in the past. I will look into what's going on and see if I can add an equivalence test for BCE and CCE with n=2 classes. I'm not sure if it's the actual output or an issue with the gradient.

dmaze · 2024-05-08T11:19:58Z

If it helps narrow down the set of changes at all, I'm having a similar problem with :binary_cross_entropy and as a workaround it's enough to pin Axon to 0.6.0; it seems to specifically be a difference between 0.6.0 and 0.6.1.

seanmor5 · 2024-05-10T13:30:39Z

@dmaze Interesting find! I'm looking into this. There shouldn't have been any changes between 0.6 and 0.6.1 that impact binary cross entropy. One possible thing to look into is if maybe it was something in Nx. Can you check to see if pinning to Nx 0.6 instead of Nx 0.7 fixes the issue?

dmaze · 2024-05-11T11:18:47Z

Staying on the same version of every library other than Axon, and changing Axon 0.6.0 to 0.6.1, I still seem to have the issue. My application reports the binary cross entropy loss and precision and recall metrics during training; on Axon 0.6.0 the loss score generally decreases and the precision and recall fairly quickly incrase, but on Axon 0.6.1 the loss score increases and precision and recall quickly go to zero.

Starting from https://github.com/dmaze/filer/blob/5c2df14a330681a65bcb2d36ab7b00894801098e/mix.lock if I only change the Axon version

diff --git a/mix.lock b/mix.lock
index 15bda15..eba65b9 100644
--- a/mix.lock
+++ b/mix.lock
@@ -1,5 +1,5 @@
 %{
-  "axon": {:hex, :axon, "0.6.0", "fd7560079581e4cedebaf0cd5f741d6ac3516d06f204ebaf1283b1093bf66ff6", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "204e7aeb50d231a30b25456adf17bfbaae33fe7c085e03793357ac3bf62fd853"},
+  "axon": {:hex, :axon, "0.6.1", "1d042fdba1c1b4413a3d65800524feebd1bc8ed218f8cdefe7a97510c3f427f3", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0 or ~> 0.7.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "d6b0ae2f0dd284f6bf702edcab71e790d6c01ca502dd06c4070836554f5a48e1"},
   "briefly": {:hex, :briefly, "0.5.1", "ee10d48da7f79ed2aebdc3e536d5f9a0c3e36ff76c0ad0d4254653a152b13a8a", [:mix], [], "hexpm", "bd684aa92ad8b7b4e0d92c31200993c4bc1469fc68cd6d5f15144041bd15cb57"},
   "bunt": {:hex, :bunt, "1.0.0", "081c2c665f086849e6d57900292b3a161727ab40431219529f13c4ddcf3e7a44", [:mix], [], "hexpm", "dc5f86aa08a5f6fa6b8096f0735c4e76d54ae5c9fa2c143e5a1fc7c1cd9bb6b5"},
   "castore": {:hex, :castore, "1.0.7", "b651241514e5f6956028147fe6637f7ac13802537e895a724f90bf3e36ddd1dd", [:mix], [], "hexpm", "da7785a4b0d2a021cd1292a60875a784b6caef71e76bf4917bdee1f390455cf5"},

So, Axon 0.6.0 -> 0.6.1, both setups with, Nx 0.6.4, EXLA 0.6.4, XLA 0.5.1.

(The application doesn't have any test data so it might be tricky for you to run independently, but if you have hundreds of scanned PDF files lying around and don't mind manually labeling them, there's a UI control to start the training loop.)

seanmor5 · 2024-05-14T13:29:02Z

This has been addressed. I had made a change to add op metadata to layers which was causing BCE to take an invalid path. FWIW another fix would be to use the from_logits: true version of BCE and remove the final :sigmoid activation in your model. That will always take the stable path

preciz · 2024-05-14T21:29:24Z

Thank you!

seanmor5 closed this as completed in 0473cf9 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss increasing with binary_cross_entropy above version 0.4.1 #567

Loss increasing with binary_cross_entropy above version 0.4.1 #567

preciz commented Apr 15, 2024 •

edited

seanmor5 commented Apr 16, 2024

dmaze commented May 8, 2024

seanmor5 commented May 10, 2024

dmaze commented May 11, 2024

seanmor5 commented May 14, 2024

preciz commented May 14, 2024

Loss increasing with binary_cross_entropy above version 0.4.1 #567

Loss increasing with binary_cross_entropy above version 0.4.1 #567

Comments

preciz commented Apr 15, 2024 • edited

seanmor5 commented Apr 16, 2024

dmaze commented May 8, 2024

seanmor5 commented May 10, 2024

dmaze commented May 11, 2024

seanmor5 commented May 14, 2024

preciz commented May 14, 2024

preciz commented Apr 15, 2024 •

edited