-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate context point sampling in "int" strategy for pandas #153
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jonas-scholz123, thanks for the PR! Happy to merge, but some initial thoughts:
I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).
That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True
case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N
context points (e.g. for a sensor placement experiment).
My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either
- make
replace
an attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or - explicitly catch N > df.index.size and raise a
SamplingTooManyPointsError
with an intuitive error message, which youassertRaises
in your unit test.
tests/test_task_loader.py
Outdated
x_coords = task["X_c"][0][0] | ||
y_coords = task["X_c"][0][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this x1 and x2 to follow the notation in the rest of the codebase
tests/test_task_loader.py
Outdated
|
||
num_unique_coords = len(self.df.xs("2020-01-01", level="time").index) | ||
|
||
task = tl("2020-01-01", num_unique_coords, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use explicit keyword arguments here? Took me a minute to remember what the 10
refers to from the position haha. Could also do something like not_relevant_for_this_test=10
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
tests/test_task_loader.py
Outdated
|
||
num_unique_coords = len(self.df.xs("2020-01-01", level="time").index) | ||
|
||
task = tl("2020-01-01", num_unique_coords, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting two small issues with this test:
- even with
replace=True
as we had before, we could by chance not sample any duplicates and pass this test. p(no_duplicates) goes down as N approaches the size of the possible values though, which you've done here, so maybe add a comment to that effect. - you aren't setting the random seed of the TaskLoader with
seed_override
in the call site here, so the test will be stochastic, is that deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I noticed those, but the chance that you don't hit a single duplicate point when sampling 1000x from a dataset with 1000 locations is basically 0, so I didn't bother to make it more robust!
Happy to set a random seed
Thanks for your CR! Some thoughts:
I'm not sure I follow, we want the same data frame row to appear multiple times in the same task/context set/encoding? Why? I would understand if we had different measurements at the same location and would want to be able to sample those, but in what scenario would users want the same observation twice?
I think this is quite unintuitive, e.g. when I specify a "split", of 80% context 20% target, I would expect
I'm not sure which is the case now, but I think 3 and either 1 or 2 are currently not the case.
Does deepsensor guarantee backward stability at this point? I would argue that |
554cd32
to
4a3f9de
Compare
I've made the changes now that I think lead to the most intuitive API, let me know if breaking backward compatibility is a hard no, happy to change it to your approach (1) in that case |
4a3f9de
to
0a09d63
Compare
📝 Description
Bugfix: previously, when sampling an integer number of context points from a pandas dataframe, duplicate points could be sampled, leading to unexpected behaviour such as duplicate context points and not sampling all points when passing N=number of total available context points.
This fixes the issue by passing
replace=False
in the appropriate function call, and adds a unit test to cover the new behaviour.✅ Checklist before requesting a review
pip install .[dev]
and runningpre-commit install
(or alternatively, manually runningruff format
before commiting)If changing or adding source code:
pytest
).If changing or adding documentation:
jupyter-book build docs --all
) and the changes look good from a manual inspection of the HTML indocs/_build/html/
.