Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting from checkpoint and appending StateReporter to a file results in multiple entries for the same time step. #10

Open
WillGPoole opened this issue Jul 13, 2023 · 2 comments

Comments

@WillGPoole
Copy link

WillGPoole commented Jul 13, 2023

Posting this here because it came up yesterday and I forgot to mention it, but it might be something for the main OpennMM repo.

One of the keen eyed at the workshop yesterday managed to work out that because the checkpoint file is written out every 1000 steps, there is time when a simulation finished on something that is not a multiple of 1000 - e.g. 1500.

Then restarting the simulation from the checkpoint file from step 1000 resulted in the log file having double entries.

When plotting the data, this results in two data points at that time step, although admittedly, the y values are basically identical.

I imagine it becomes more and more obvious when the disparity between the checkpoint file and the log file is greater than 10 fold and on how many steps it has to go back.

In the example below I could have got points all the way to 14900 and still restarted at 14000. Imagine if we had written a checkpoint every 10000 steps for example.

Very quick example:

 13900,-146856.87880890456,305.87961649181017,91.87334675537606
 14000,-146838.48677000694,302.0436151483807,91.85773612934197
 14100,-146570.08052000694,302.9515737651756,91.85773612934197
----------- Simulation ended, and restarted from 14000 -------------
 14100,-146569.67427000694,302.94798025386604,91.85773612934197
 14200,-146859.15565928526,305.1387575727,91.68905971263145
 14300,-147399.1003796428,305.3627325364048,91.29362283195196
 14400,-147585.3503796428,301.270411426622,91.29362283195196
 14500,-146945.1628796428,298.35001246529237,91.29362283195196

I guess a fix would be for OpenMM at somepoint in the reporter setup to check the current timestep vs the latest reported step and delete lines from the log file?

@WillGPoole
Copy link
Author

image
This is temperature

@sef43
Copy link
Contributor

sef43 commented Jul 13, 2023

Thanks for pointing this out! This is a limitation of the workshop script. The checkpointing and restarting procedure (or perhaps analysis) will need to be more sophisticated. I am not sure what the current best practice for this is and welcome suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants