-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options to do bootstrapping #297
Comments
Starting to work on this in #302 |
This sounds like a reasonable approach, except instead of generating the bootstrap samples within You can create some private methods to retrieve either the bootstrapped |
Sounds good. I'll get this working on expectations first, and then try this. Note I don't think we want to store the bootstrap indices, since they are Nxnbootstraps, which is likely usually bigger than u_kn. But if can store the random seed state, then we can regenerate them very quickly. |
Good point. The main idea was just too use lazy instantiation so bootstrapping isn't done if it isn't needed. One other concern might be serialization: if the random number sequence isn't reproducible on different hardware or OS versions, we want to regenerate the bootstrap samples if the object is unpickled, or just not cache the bootstrap samples when pickled. I think that means you may need to remove some items from the object dictionary when getstate is called. |
Correct, this could be a problem. MBAR doesn't currently pickle anything, that I'm aware of, so we could just document that to warn people in the future.
|
For (3), you'd just need something like this: def __getstate__(self):
# Copy the object's state from self.__dict__ which contains
# all our instance attributes. Always use the dict.copy()
# method to avoid modifying the original state.
state = self.__dict__.copy()
# Remove the bootstrap cache (or whatever it's called)
del state['_bootstrap_cache']
return state |
OK, looks like I was wrong. At certain points, we have to have the bootstrapped log weights for each sample, which involves either 1) storing copies of the log weights (nstates x nsamples) or 2) storing copies of the random integers generated (also nstates x nsamples), since we need the random integers to select out the correct copies of the u_kn matrix If that's the case (and I don't discover any other issues), I probably will try to go with the storing the log weights themselves, since it's a bit faster and simpler to store that extra data rather than storing the integers and recomputing. If we use a standard recommendation of 200 bootstraps, that would result in needing to store 200 x nstates x nsamples numbers, which might in some unusual but not crazy situations needed 200 x 200 states x 10,000 samples/state = 400,000,000 doubles, which is 3.2 GB memory. Which could cause problems sometimes for large problems. Storing the random ints would save 50% space (32 bit vs 64 bit), but would still fail at problems the same order of size. What we could do is make caching the intermediate calculations the default, so it only needed to be done once, with an option to switch to regeneration for each calculation, which would be much less memory intensive (though more computationally intensive), and have recommendations / documentation about this. |
One other thought about bootstrap: there are some calculations for which this type of bootstrapping won't work. For example, if someone is doing BAR consecutively among states (1->2, 2->3, 3->4) then one would need to bootstrap the ENTIRE process, not the individual steps (because the calculations are correlated). There is no possible way (that I can think of) to enforce this in a calculation, as people are free to do however complicated a process as possible. That can only really be fought with good documentation and examples, or through other tools, like alchemlyb that use MBAR. |
Laying out the issues for including bootstrapping in MBAR. There's essentially thee
One relatively easy thing to do is collect bootstrap samples within init. This can generate a set of free energies that can then be processed by the individual routines to generate various statistics (standard deviations, confidence intervals, etc). Storing Nboots x K (where K is the number of nonzero states) should not generally be a big issue.
However, this just generates the distribution of free energies alternate free energies. We also want to be able to bootstrap expectations and potentials of mean forces, as well as perturbed free energies.
We could just run the bootstrapping every single time we are interested in new expectations, but that seems wasteful and expensive if we generate a lot of them.
If they are observables with NEW samples, we should just bootstrap everything over again. But I don't think we support that currently, so we can assume that the samples are fixed between calls to init (aside: do we actually enforce that?)
If they are observables with no new sampled states, we should not need to regenerate the bootstrapped free energies, as they will not be affected. But we would need to save the random numbers, so that we know which energies are in the samples.
However, that is memory intensive, as we need Nboots x Ntotal number of samples, which could be quite large, frequently larger that the u_kn matrix itself (Whenever nboots > K).
Option for making it repeatable is to save the state of the random number generator right before generating the bootstrap samples; then one can regenerate the same indices later, and redo the same bootstrapping for the perturbed free energies and expectations.
This approach seems like it might work, then:
Does this sound reasonable?
The text was updated successfully, but these errors were encountered: