Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnExceptionCheckpoint: training resumes if ckpt found, even if no ckpt_path provided #19827

Open
brijow opened this issue Apr 29, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@brijow
Copy link

brijow commented Apr 29, 2024

Bug description

However, if an OnExceptionCheckpoint is provided to the trainer's list of callbacks, then even if Trainer.fit is called without providing ckpt_path argument then the CheckpointConnector will search for a checkpoint in OnExceptionCheckpoint.dirpath, and if one is found it will be used to resume training.

Further, the warnings shown at the beginning of Trainer.fit when OnExceptionCheckpoint callback is enabled IMO are incorrect, (see logs below)

What version are you seeing the problem on?

v2.2

How to reproduce the bug

from lightning.pytorch.callbacks import OnExceptionCheckpoint, ModelCheckpoint
from lightning.pytorch import Trainer
from lightning.pytorch.demos.boring_classes import BoringModel

trainer = Trainer(
    # NOTE: either of these callback lists reproduce the issue, (and same warnings mentioned below)
    # callbacks=[OnExceptionCheckpoint("."), ModelCheckpoint(".", save_last=True)],
    callbacks=[OnExceptionCheckpoint("."), ModelCheckpoint(".", save_last=False)],
    max_epochs=3,
)
trainer.fit(model=BoringModel())

# calling `fit` again will result in resumed training from discovered checkpoint in cwd:
trainer.fit(model=BoringModel())

Error messages and logs

The following warnings are confusing and arguably wrong:

.../checkpoint_connector.py:126: `.fit(ckpt_path=None)` was called without a model. The last model of the previous `fit` call will be used. You can pass `fit(ckpt_path='best')` to use the best model or `fit(ckpt_path='last')` to use the last model. If you pass a value, this warning will be silenced.
.../checkpoint_connector.py:186: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded. HINT: Set `ModelCheckpoint(..., save_last=True)`.```

### Environment

PyTorch Lightning Version: 2.2.1


### More info

_No response_
@brijow brijow added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

1 participant