-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss increasing with binary_cross_entropy above version 0.4.1 #567
Comments
Ah yeah I have had issues with BCE in the past. I will look into what's going on and see if I can add an equivalence test for BCE and CCE with n=2 classes. I'm not sure if it's the actual output or an issue with the gradient. |
If it helps narrow down the set of changes at all, I'm having a similar problem with |
@dmaze Interesting find! I'm looking into this. There shouldn't have been any changes between 0.6 and 0.6.1 that impact binary cross entropy. One possible thing to look into is if maybe it was something in Nx. Can you check to see if pinning to Nx |
Staying on the same version of every library other than Axon, and changing Axon 0.6.0 to 0.6.1, I still seem to have the issue. My application reports the binary cross entropy loss and precision and recall metrics during training; on Axon 0.6.0 the loss score generally decreases and the precision and recall fairly quickly incrase, but on Axon 0.6.1 the loss score increases and precision and recall quickly go to zero. Starting from https://github.com/dmaze/filer/blob/5c2df14a330681a65bcb2d36ab7b00894801098e/mix.lock if I only change the Axon version diff --git a/mix.lock b/mix.lock
index 15bda15..eba65b9 100644
--- a/mix.lock
+++ b/mix.lock
@@ -1,5 +1,5 @@
%{
- "axon": {:hex, :axon, "0.6.0", "fd7560079581e4cedebaf0cd5f741d6ac3516d06f204ebaf1283b1093bf66ff6", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "204e7aeb50d231a30b25456adf17bfbaae33fe7c085e03793357ac3bf62fd853"},
+ "axon": {:hex, :axon, "0.6.1", "1d042fdba1c1b4413a3d65800524feebd1bc8ed218f8cdefe7a97510c3f427f3", [:mix], [{:kino, "~> 0.7", [hex: :kino, repo: "hexpm", optional: true]}, {:kino_vega_lite, "~> 0.1.7", [hex: :kino_vega_lite, repo: "hexpm", optional: true]}, {:nx, "~> 0.6.0 or ~> 0.7.0", [hex: :nx, repo: "hexpm", optional: false]}, {:polaris, "~> 0.1", [hex: :polaris, repo: "hexpm", optional: false]}, {:table_rex, "~> 3.1.1", [hex: :table_rex, repo: "hexpm", optional: true]}], "hexpm", "d6b0ae2f0dd284f6bf702edcab71e790d6c01ca502dd06c4070836554f5a48e1"},
"briefly": {:hex, :briefly, "0.5.1", "ee10d48da7f79ed2aebdc3e536d5f9a0c3e36ff76c0ad0d4254653a152b13a8a", [:mix], [], "hexpm", "bd684aa92ad8b7b4e0d92c31200993c4bc1469fc68cd6d5f15144041bd15cb57"},
"bunt": {:hex, :bunt, "1.0.0", "081c2c665f086849e6d57900292b3a161727ab40431219529f13c4ddcf3e7a44", [:mix], [], "hexpm", "dc5f86aa08a5f6fa6b8096f0735c4e76d54ae5c9fa2c143e5a1fc7c1cd9bb6b5"},
"castore": {:hex, :castore, "1.0.7", "b651241514e5f6956028147fe6637f7ac13802537e895a724f90bf3e36ddd1dd", [:mix], [], "hexpm", "da7785a4b0d2a021cd1292a60875a784b6caef71e76bf4917bdee1f390455cf5"}, So, Axon 0.6.0 -> 0.6.1, both setups with, Nx 0.6.4, EXLA 0.6.4, XLA 0.5.1. (The application doesn't have any test data so it might be tricky for you to run independently, but if you have hundreds of scanned PDF files lying around and don't mind manually labeling them, there's a UI control to start the training loop.) |
This has been addressed. I had made a change to add op metadata to layers which was causing BCE to take an invalid path. FWIW another fix would be to use the |
Thank you! |
I had an issue where I wanted to retrain a model with new data but the loss was increasing even though we haven't changed the code.
Previously discussed on Elixir Forum:
https://elixirforum.com/t/model-no-longer-learning-any-ideas/62883/6
Notebooks reproducing the issue with data:
https://github.com/preciz/not_learning
Surprisingly the solution was to revert Axon version to 0.4.1 or to change from
binary_cross_entropy
loss function tocategorical_cross_entropy
and change the activation tosoftmax
.Since the exact same model is learning fine on Axon version 0.4.1 I wonder if this is a bug and there should be an issue about it.
The text was updated successfully, but these errors were encountered: