Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss increasing with binary_cross_entropy above version 0.4.1 #567

Closed
preciz opened this issue Apr 15, 2024 · 6 comments
Closed

Loss increasing with binary_cross_entropy above version 0.4.1 #567

preciz opened this issue Apr 15, 2024 · 6 comments

Comments

@preciz
Copy link
Contributor

preciz commented Apr 15, 2024

I had an issue where I wanted to retrain a model with new data but the loss was increasing even though we haven't changed the code.

Previously discussed on Elixir Forum:
https://elixirforum.com/t/model-no-longer-learning-any-ideas/62883/6

Notebooks reproducing the issue with data:
https://github.com/preciz/not_learning

Surprisingly the solution was to revert Axon version to 0.4.1 or to change from binary_cross_entropy loss function to categorical_cross_entropy and change the activation to softmax.

Since the exact same model is learning fine on Axon version 0.4.1 I wonder if this is a bug and there should be an issue about it.

@seanmor5
Copy link
Contributor

Ah yeah I have had issues with BCE in the past. I will look into what's going on and see if I can add an equivalence test for BCE and CCE with n=2 classes. I'm not sure if it's the actual output or an issue with the gradient.

@dmaze
Copy link

dmaze commented May 8, 2024

If it helps narrow down the set of changes at all, I'm having a similar problem with :binary_cross_entropy and as a workaround it's enough to pin Axon to 0.6.0; it seems to specifically be a difference between 0.6.0 and 0.6.1.

@seanmor5
Copy link
Contributor

@dmaze Interesting find! I'm looking into this. There shouldn't have been any changes between 0.6 and 0.6.1 that impact binary cross entropy. One possible thing to look into is if maybe it was something in Nx. Can you check to see if pinning to Nx 0.6 instead of Nx 0.7 fixes the issue?

@dmaze
Copy link

dmaze commented May 11, 2024

Staying on the same version of every library other than Axon, and changing Axon 0.6.0 to 0.6.1, I still seem to have the issue. My application reports the binary cross entropy loss and precision and recall metrics during training; on Axon 0.6.0 the loss score generally decreases and the precision and recall fairly quickly incrase, but on Axon 0.6.1 the loss score increases and precision and recall quickly go to zero.

Starting from https://github.com/dmaze/filer/blob/5c2df14a330681a65bcb2d36ab7b00894801098e/mix.lock if I only change the Axon version

diff --git a/mix.lock b/mix.lock
index 15bda15..eba65b9 100644
--- a/mix.lock
+++ b/mix.lock
@@ -1,5 +1,5 @@
 %{
-  "axon": {:hex, :axon, "0.6.0", "fd7560079581e4cedebaf0cd5f741d6ac3516d06f204ebaf1283b1093bf66ff6", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "204e7aeb50d231a30b25456adf17bfbaae33fe7c085e03793357ac3bf62fd853"},
+  "axon": {:hex, :axon, "0.6.1", "1d042fdba1c1b4413a3d65800524feebd1bc8ed218f8cdefe7a97510c3f427f3", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0 or ~> 0.7.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "d6b0ae2f0dd284f6bf702edcab71e790d6c01ca502dd06c4070836554f5a48e1"},
   "briefly": {:hex, :briefly, "0.5.1", "ee10d48da7f79ed2aebdc3e536d5f9a0c3e36ff76c0ad0d4254653a152b13a8a", [:mix], [], "hexpm", "bd684aa92ad8b7b4e0d92c31200993c4bc1469fc68cd6d5f15144041bd15cb57"},
   "bunt": {:hex, :bunt, "1.0.0", "081c2c665f086849e6d57900292b3a161727ab40431219529f13c4ddcf3e7a44", [:mix], [], "hexpm", "dc5f86aa08a5f6fa6b8096f0735c4e76d54ae5c9fa2c143e5a1fc7c1cd9bb6b5"},
   "castore": {:hex, :castore, "1.0.7", "b651241514e5f6956028147fe6637f7ac13802537e895a724f90bf3e36ddd1dd", [:mix], [], "hexpm", "da7785a4b0d2a021cd1292a60875a784b6caef71e76bf4917bdee1f390455cf5"},

So, Axon 0.6.0 -> 0.6.1, both setups with, Nx 0.6.4, EXLA 0.6.4, XLA 0.5.1.

(The application doesn't have any test data so it might be tricky for you to run independently, but if you have hundreds of scanned PDF files lying around and don't mind manually labeling them, there's a UI control to start the training loop.)

@seanmor5
Copy link
Contributor

This has been addressed. I had made a change to add op metadata to layers which was causing BCE to take an invalid path. FWIW another fix would be to use the from_logits: true version of BCE and remove the final :sigmoid activation in your model. That will always take the stable path

@preciz
Copy link
Contributor Author

preciz commented May 14, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants