Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: resume training without overwriting model dir #162

Open
cdleong opened this issue Dec 14, 2021 · 6 comments
Open

Feature request: resume training without overwriting model dir #162

cdleong opened this issue Dec 14, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@cdleong
Copy link

cdleong commented Dec 14, 2021

Currently it does not seem to be possible to resume a previous training where it left off, if there's already a model directory.

If overwrite is set to False,

def make_model_dir(model_dir: str, overwrite=False) -> str:
will simply error out.

If overwrite is set to True, it will wipe out the existing mdoel directory.

@juliakreutzer
Copy link
Collaborator

Yes, a "resume" option in addition to overwriting would be helpful. It would require the following:

  1. Continue logging into the same log file. Or rename the existing one.
  2. Load the most recent checkpoint.
  3. Load the config file from the same model dir.
  4. Continue the count of steps where it was left off.

In order to avoid config incompatibilities, it would probably best to not add a "resume" option to the config, but rather as a command line argument to the training call.

One thing to note: Since we are not storing the state of the iterator, the data will be reshuffled and will be trained on in a new random order. This means that there is still a slight difference between running the model continuously versus interrupting and resuming it.

Happy to review any attempts to implement this!

@cdleong
Copy link
Author

cdleong commented Feb 8, 2022

Trying to think through desired behavior if we create a continue flag, and it now interacts with the overwrite flag.

directory exists already continue flag is set to... overwrite flag is set to... what do?
FALSE FALSE FALSE make the dir and train
FALSE FALSE TRUE make the dir and train
FALSE TRUE FALSE make the dir and train
FALSE TRUE TRUE make the dir and train
TRUE FALSE FALSE make an error and quit. User didn't want continue, and didn't want to overwrite. But let them know those are options
TRUE FALSE TRUE User wanted overwrite, so delete the dir and start fresh. If continue defaults to False, then this could happen by accident if someone reruns with the same config.
TRUE TRUE FALSE Continue previous training, don't overwrite.
TRUE TRUE TRUE User has requested both continuing and overwriting, which don't seem compatible. Why would you want both continue and overwrite? Maybe because they just reran the same command? Ask the user which behavior they want I guess?

@juliakreutzer
Copy link
Collaborator

That looks good. In the 3x TRUE case, we could throw a warning and set overwrite=False, as I assume the case you described would be the best reason for having this configuration.

@cdleong
Copy link
Author

cdleong commented Feb 9, 2022

OK, working from outside in:

  • training.py is apparently what runs when we do joeynmt train
  • It takes in just one arg currently, the path to a config.yml. Presumably we'd add an arg to the ArgumentParser in the main function, if we're thinking of taking in a --continue
  • that would then need to be passed on to train(), defined around line 796. Probably as an optional arg with a default value, something like continue_previous_training=False
  • make_model_dir should probably be where we implement most of the logic from the truth table. That's here.

@cdleong
Copy link
Author

cdleong commented Feb 9, 2022

Also, based on a slack discussion with Julia, the default should be continue=False, overwrite=False, so as to err on the non-overwrite side, as maybe they used a config with overwrite of a previous model and then want to continue that model.

@cdleong
Copy link
Author

cdleong commented Mar 22, 2022

In related news: https://www.philschmid.de/sagemaker-spot-instance

someday having S3 checkpointing would be cool!

@may- may- added the enhancement New feature or request label Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants