-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeseries can be tricked by a couple bad first inputs #283
Comments
Thanks! Looking into this now! |
If I change this line to use a more robust estimator of cov(A,B) by changing sigma2_AB = (dA_n * dB_n).mean() # standard estimator to ensure C(0) = 1 to sigma2_AB = np.median(dA_n * dB_n) # standard estimator to ensure C(0) = 1 |
The number of uncorrelated samples seems to be underestimated by this change, but it gets the maximum in the desired place. I can then recompute the true |
Another possibility is to only compute the following statistics over the last half of the timeseries: # Compute mean of each timeseries.
half = int(len(A_n) / 2)
mu_A = np.mean(A_n[half:])
mu_B = np.mean(B_n[half:])
...
# Compute estimator of covariance of (A,B) using estimator that will ensure C(0) = 1.
sigma2_AB = np.mean(dA_n[half:] * dB_n[half:]) # standard estimator to ensure C(0) = 1 |
I'll try these alterations in my original test set and see which choice minimally disrupts the desired optimal performance of the method. |
Interested in the resolution of this issue as it is affecting me too. I can provide another example if that would be helpful! |
Please do! The more, the better! |
Okay, attached are a test script and data. This timeseries is from a simulation with data recorded every 500 steps, hence the length. Even though the first couple values of are much lower than the rest,
|
Thanks! This is super helpful @ocmadin! |
Seems like this is also a duplicate of #277 |
@jchodera just wanted to bring this up again, as I have been getting outliers in heat capacity calculations from fluctuations, and this turned out to be the culprit. I have a temporary caveman solution, but would be interested in a more permanent (and elegant) solution. |
Can you post a sample data set that is causing problems? |
Okay, so here is a comparison of two datasets, one where These are both 9 ns simulations of 200 cyclohexane molecules at 293.15 K and 1.01 bar. In the first ( While in the second ( Interestingly, these two timeseries start at roughly equal volumes (0.02302 for good vs 0.02304 for bad), but within the first 10 timesteps, the "bad" timeseries hits a max value of 0.02341, where the "good" timeseries only hits a max value of 0.02322: It seems like this little spike is the difference between |
I've had a look at these with some of the equilibration heuristics we tested here. Specifically:
Naden IssueAs discussed in this issue, selecting the truncation point by minimum marginal standard error, rather than maximum effective sample size increases sensitivity to initial bias and means that all methods other than
Madin First IssueHere, both the Here, the To understand why using the minimum marginal standard error (or SSE - squared standard error) in the The ESS (with original PyMBAR autocorrelation treatment) shows a large spike at index 0. ESS is given by ESS = n_actual * Var_uncor / Var_cor, where "n_actual" is the actual number of samples, and "uncor" and "cor" denote the uncorrelated and correlated variance estimates. Plotting these two variance estimates (normalised so that the first term is 1) shows that the first few very biased samples dramatically increase Var_uncor, but decrease Var_cor (by shifting the mean closer to a more central value for the later data): Therefore, using 1/SSE = n_actual / Var_cor dramatically reduces the initial spike ( Madin Second Issue (Bad Cyclohexane Series)Here, I find that Summary
CodeAll code and output: pymbar_equil_issues.tar.gz (you'll need to install red with Thanks! |
If you have one or two wildly off datum points for
timeseries.detectEquilibration
from the start of the timeseries, then the true equilibration time is woefully underpredicted. @jchodera requested I post sample data and script to process and regenerate the plot and output from below.The output below is from Top plot is the full timeseries with bars showing where "Equilibration" stops, bottom is the same set from the 2nd sample onwards (
series[1:]
). Below the image is the code used to create. The dataset is in the attached zip file. @jchodera If you only want the series to look at, you only need theimport
s and the first three lines of the code below.explicit0_timeseries.npy.zip
The text was updated successfully, but these errors were encountered: