-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early stopping criteria is only checked for the ReduceLROnPlateau
scheduler
#215
Comments
@mcognetta Actually, we initially introduced the joeynmt/configs/transformer_small.yaml Line 98 in 0968187
The term Although Joeynmt doesn't stop the iteration, it skips to save the weights, if the validation performance has not increased. Lines 654 to 658 in 0968187
I'm not sure how many users wish to stop the iteration instead of just skipping the checkpoint. I'm happy to discuss further!! |
Hi, thanks for the background info. IMO, the main reason one would want a validation-based early-stopping mechanism that isn't tied directly to the learning rate is when you are training several models on a corpus that you don't know a priori how long it should take to converge. For example, when I start on a new corpus, I would like to be able to just set the max_epochs to 100 and let it run until convergence (and maybe it converges in 30 epochs and I don't want to waste 70 epochs worth of compute. Tying it to learning rate works for the plateau scheduler (and, I guess in theory the inverse sqrt one, but it would take a ton of steps to decay the LR enough to stop training), but not the other schedulers. This is particularly helpful when you are running many jobs in parallel and you don't want to waste a bunch of compute on epochs that are diverging and you don't want to have to be manually inspecting the validation scores for all of them throughout training. I know this library has different goals than fairseq (and, in fact, this is why I am so interested in this library), but it should be noted that this functionality is the default in fairseq and most other NMT packages I have used and this meaning of patience is pretty common in the literature. I have a fork where I implemented it in just a few lines (I kept patience the same for the plateau scheduler and introduced a new config variable that implements this idea). I will open a PR if this is not a hard no (ofc, happy to discuss further). |
I understand there is such a situation and would be nice. My concern is how essential the functionality is, though. You know, I also implement many small "nice-to-have"s, but keep them in my personal fork and don't merge them to the main branch. Thank you again for your effort and deep consideration! I really appreciate that. I'm so glad to see our joey is still growing 🐨 |
Understood, ofc I will defer to your judgement. However, do you mind if I open (and then immediately close) a PR here linked to this issue so that others can use it in the future if they need it (adding the relevant code snippets here is a little cumbersome and it would be good for posterity)? It may also serve as a place for the discussion about if this feature is wanted by many users or not. Thanks! |
Yes, open a PR, please! I'm more than happy for democratic decision. |
Describe the bug
Early stopping and patience is only considered in the
ReduceLROnPlateau
. For example, usingbleu
as an early stopping criteria and theWarmupInverseSquareRootScheduler
will not stop training early, even if validation performance does not improve for more thanpatience
iterations.The issue is in these lines:
joeynmt/joeynmt/training.py
Lines 532 to 540 in 0968187
There should be an additional
stats
variable checking if validation performance has not increase inN
iterations.To Reproduce
Steps to reproduce the behavior:
scheduling
parameter toscheduling: "warmupinversesquareroot"
Logged output
If possible, please add the training log and validation reports or excerpts of them.
Expected behavior
Training stops early when validation stops improving.
System (please complete the following information):
Side note: Sorry for the influx of issues (there will be a few more in coming days). I am getting ramped up on this repo for a project of mine, so I have been doing lots of exploring. I will help to contribute fixes to anything that I report here.
The text was updated successfully, but these errors were encountered: