-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPY-66: MetadataInconsistency
throws a fatal error in sync
mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads
#1888
Comments
Hey @ocharles 👋 While we investigate, I can suggest some workarounds:
Would any of these be an acceptable workaround for now? |
Oh, mode=sync sounds good, I missed that! I would like to keep checkpoints uploaded as Neptune makes it so convenient and it's really nice to have all the information! I guess the ideal behaviour would be that only checkpoints are uploaded synchronously, but I can probably live with a minor hit at the end of validation or whenever I log training steps. Thanks, I'll try this when I'm back at my computer next week. I really appreciate the (incredibly) speedy response! |
Most of the sync lag will be introduced due to uploading checkpoints. So there shouldn't be a noticeable different between syncing the entire state of the run or just checkpoints. Please let me know if using I'll be at a company offsite next week and then at NeurIPS, but will look into this once back 📝 |
I think this might now be causing what was previously a warning to be a fatal exception:
I know this is known (#733), but I've never seen this cause a crash. Here's the stack trace:
|
MetadataInconsistency
throws a fata error in sync
mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads
MetadataInconsistency
throws a fata error in sync
mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploadsMetadataInconsistency
throws a fatal error in sync
mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads
Thanks for bringing this issue to our notice. It should be a warning and not a fatal error. I was able to reproduce it and have created a ticket for the team to look into it. Meanwhile, would you be open to slightly changing the Lightning source code to handle this error temporarily? from neptune.exceptions import MetadataInconsistency
for key, val in metrics.items():
try:
self.run[key].append(val, step=step)
except MetadataInconsistency as e:
log.warning(f"Failed to log metric {key} with value {val} at step {step} due to: {e}") We are just wrapping the Would this work as a band-aid? |
Describe the bug
I'm fairly reliably seeing the following type of exception:
My training loop takes about 15s per epoch, and is set to check validation every 10 epochs, and validation also performs model checkpointing if the validation metric improves. At the start of training every validation step improves, so I churn through a lot of checkpoints. Each checkpoint is about 500MB, and I have it set to save the top 3.
I've also noticed that checkpoints are initially written as .bin files, and then seen to get renamed to .ckpt files - perhaps this renaming is happening during upload and causing issues?
I'm using PyTorch Lightning and the built in ModelCheckpoint. I'm using the built-in NeptuneLogger integration.
Reproduction
Unfortunately I don't have a shareable reproduction. I can fairly reliably reproduce this in production though, though it seems network depend as yesterday the same code didn't have this problem, but I'm constantly hitting it this morning.
It's fairly severe, as it means I completely lose any real time metrics.
Expected behavior
All checkpoints would upload and rotate as expected.
Traceback
See above.
Environment
I can't get this right now, please shout if you think you need to know this!
The text was updated successfully, but these errors were encountered: