-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: resume training without overwriting model dir #162
Comments
Yes, a "resume" option in addition to overwriting would be helpful. It would require the following:
In order to avoid config incompatibilities, it would probably best to not add a "resume" option to the config, but rather as a command line argument to the training call. One thing to note: Since we are not storing the state of the iterator, the data will be reshuffled and will be trained on in a new random order. This means that there is still a slight difference between running the model continuously versus interrupting and resuming it. Happy to review any attempts to implement this! |
Trying to think through desired behavior if we create a continue flag, and it now interacts with the overwrite flag.
|
That looks good. In the 3x TRUE case, we could throw a warning and set overwrite=False, as I assume the case you described would be the best reason for having this configuration. |
OK, working from outside in:
|
Also, based on a slack discussion with Julia, the default should be continue=False, overwrite=False, so as to err on the non-overwrite side, as maybe they used a config with overwrite of a previous model and then want to continue that model. |
In related news: https://www.philschmid.de/sagemaker-spot-instance someday having S3 checkpointing would be cool! |
Currently it does not seem to be possible to resume a previous training where it left off, if there's already a model directory.
If overwrite is set to
False
,joeynmt/joeynmt/helpers.py
Line 34 in cd6974f
If overwrite is set to
True
, it will wipe out the existing mdoel directory.The text was updated successfully, but these errors were encountered: