-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap weights #485
base: main
Are you sure you want to change the base?
Bootstrap weights #485
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
Hey Alan! Thanks for opening another pull request. Since this time, the PR is not related to a minor fix but an extension, could you extend the PR description? In particular, could you state the problem (i.e. the missing feature), how this is called in the bootstrap literature, and how you plan to implement it. If you know other libraries that implement this feature, this would also be great to know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far! Thank you!
As mentioned above, I think we need to go through a few things first. If implementing these changes will take too much of your time, you can also re-open this pull request as an issue (feature request). Then we will think about how to implement the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only check that the get_bootstrap_indices
function works when called with the new weights
arguments. But I think you can do more:
- I would rename
"wts"
to"weights"
, just to be explicit - If the weights are uniform (as in your example), the bootstrap indices with and without the weights argument should be the same. Could you test that? You will need to create two random number generators with the same state.
- Additionally, it would be very helpful to see if the weights work. An extreme case to check this would be a weighting vector where one entry is 1, and the other is 0. Then, you could (probabilistically) check that the index corresponding to the 0 weight is never chosen.
bootstrap_indices = list( | ||
rng.choice( | ||
n_obs, | ||
size=(n_draws, n_obs), | ||
replace=True, | ||
p=data[weights] / data[weights].sum(), | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more readable if you compute the vector of probabilities outside of these if conditions. If weights
is None
, the vector of probabilities could be None
. In this case, you could also write the same code for both cases:
probs = _calculate_bootstrap_indices_probs(data, weight_by, cluster_by)
bootstrap_indices = list(
rng.choice(
n_obs,
size=(n_draws, n_obs),
replace=True,
p=probs,
)
)
This will be slower than np.integers(n_obs, ...)
but I think this is a trade-off we would want to make here.
clusters = data.groupby(cluster_by)[weights].sum().reset_index() | ||
|
||
drawn_clusters = rng.choice( | ||
clusters[cluster_by], | ||
size=(n_draws, len(clusters)), | ||
replace=True, | ||
p=clusters[weights] / clusters[weights].sum(), | ||
) | ||
|
||
bootstrap_indices = _convert_cluster_ids_to_indices( | ||
data[cluster_by], drawn_clusters | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be updated. Especially the line clusters[clusters_by]
is very confusing. I would think that clusters
is equal to something like data[cluster_by]
.
Similarly as above, it would be nice to pre-compute the probabilities.
Will try to address all of this in the coming days! Thanks! |
Hey @alanlujan91, thanks again for working on this feature. We are planning a major refactoring and want to close as many PRs as possible beforehand. Are you planning to work on this PR in the near future? |
hi @timmens I know this feature as proportional sampling, which is something we have done in Econ-ARK but I don't have a good link to the literature. https://en.wikipedia.org/wiki/Probability-proportional-to-size_sampling I can work on this more if you think this is a desired feature, I've just been pretty busy so this hasn't been a priority. |
Hi Alan, thanks for the PR. I think this would be nice to have in estimagic. As Tim said, we are planning major refactorings and would like to have few PRs open when we start. Do you think you can finish within the next three weeks? Otherwise it might be a good idea to convert this to an issue and postpone the actual implementation until you have more free time. |
@janosg I can definitely finish this within 3 weeks |
Modified code according to above comments from @timmens, except for
This one is complicated because of the following note from numpy It can be done, but requires generating dummy p's as p = np.ones(n_obs)/n_obs when weights_by is None. Let me know if this is desirable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look really good, thank you! I only have a few minor remarks and two small request, and then we can merge 🎉
Request 1
The function _get_probs_for_bootstrap_indices
is not tested yet and the docstring is not that informative. Can you
- Extend the docstring by explaining what to expect in the different cases (i.e.,
weight_by
is not None,cluster_by
is not None, etc.) - Write unit tests for these cases. Looking at the function, it is not directly clear to me that the case where both weight_by and cluster_by are not None is correctly handled. With an example in a unit test, this would be much easier to see.
Request 2
It can be done, but requires generating dummy p's as p = np.ones(n_obs)/n_obs when weights_by is None. Let me know if this is desirable
Ah, I see; very good point! In this case, can you add a structural test like this:
def test_get_bootstrap_indices_heterogeneous_weights():
data = pd.DataFrame({"id": [0, 1], "w_homogenous": [0.5, 0.5], "w_heterogenous": [0.1, 0.9]})
res_homogenous = get_bootstrap_indices(data, weight_by="w_homogenous", n_draws=1_000, rng=get_rng(seed=0))
res_heterogenous = get_bootstrap_indices(data, weight_by="w_heterogenous", n_draws=1_000, rng=get_rng(seed=0))
# Given the weights, the first sample mean should be close to 0.5, while the second one should be close to 0.9
assert np.mean(res_homogenous) < 0.75 < np.mean(res_heterogenous)
@alanlujan91 do you think it makes sense to keep this PR open or should we revisit the topic some other time? |
Sorry I took so long to get back to this. I finally addressed all of the comments, but now it seems there's some unrelated issue to this PR making the checks fail |
Thanks @alanlujan91! Yes, this is unrelated and comes from a new release of DFO-LS. I'll fix it soon. |
Add
weights
kwarg to handle survey weights