Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate context point sampling in "int" strategy for pandas #153

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jonas-scholz123
Copy link
Collaborator

📝 Description

Bugfix: previously, when sampling an integer number of context points from a pandas dataframe, duplicate points could be sampled, leading to unexpected behaviour such as duplicate context points and not sampling all points when passing N=number of total available context points.

This fixes the issue by passing replace=False in the appropriate function call, and adds a unit test to cover the new behaviour.

✅ Checklist before requesting a review

  • I have installed developer dependencies with pip install .[dev] and running pre-commit install (or alternatively, manually running ruff format before commiting)

If changing or adding source code:

  • tests are included and are passing (run pytest).
  • documentation is included or updated as relevant, including docstrings.

If changing or adding documentation:

  • docs build successfully (jupyter-book build docs --all) and the changes look good from a manual inspection of the HTML in docs/_build/html/.

@jonas-scholz123 jonas-scholz123 added the bug Something isn't working label Feb 6, 2025
@jonas-scholz123 jonas-scholz123 self-assigned this Feb 6, 2025
Copy link
Collaborator

@tom-andersson tom-andersson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jonas-scholz123, thanks for the PR! Happy to merge, but some initial thoughts:

I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).

That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N context points (e.g. for a sensor placement experiment).

My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either

  1. make replace an attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or
  2. explicitly catch N > df.index.size and raise a SamplingTooManyPointsError with an intuitive error message, which you assertRaises in your unit test.

Comment on lines 170 to 171
x_coords = task["X_c"][0][0]
y_coords = task["X_c"][0][1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this x1 and x2 to follow the notation in the rest of the codebase


num_unique_coords = len(self.df.xs("2020-01-01", level="time").index)

task = tl("2020-01-01", num_unique_coords, 10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use explicit keyword arguments here? Took me a minute to remember what the 10 refers to from the position haha. Could also do something like not_relevant_for_this_test=10.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do


num_unique_coords = len(self.df.xs("2020-01-01", level="time").index)

task = tl("2020-01-01", num_unique_coords, 10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting two small issues with this test:

  1. even with replace=True as we had before, we could by chance not sample any duplicates and pass this test. p(no_duplicates) goes down as N approaches the size of the possible values though, which you've done here, so maybe add a comment to that effect.
  2. you aren't setting the random seed of the TaskLoader with seed_override in the call site here, so the test will be stochastic, is that deliberate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I noticed those, but the chance that you don't hit a single duplicate point when sampling 1000x from a dataset with 1000 locations is basically 0, so I didn't bother to make it more robust!

Happy to set a random seed

@jonas-scholz123
Copy link
Collaborator Author

Thanks for your CR! Some thoughts:

I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).

I'm not sure I follow, we want the same data frame row to appear multiple times in the same task/context set/encoding? Why?

I would understand if we had different measurements at the same location and would want to be able to sample those, but in what scenario would users want the same observation twice?

That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N context points (e.g. for a sensor placement experiment).

I think this is quite unintuitive, e.g. when I specify a "split", of 80% context 20% target, I would expect

  1. All the stations to be part of either the context or the target set
  2. The context set to contain 80% of the stations
  3. No stations to be featured twice

I'm not sure which is the case now, but I think 3 and either 1 or 2 are currently not the case.

My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either

  1. make replace an attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or
  2. explicitly catch N > df.index.size and raise a SamplingTooManyPointsError with an intuitive error message, which you assertRaises in your unit test.

Does deepsensor guarantee backward stability at this point? I would argue that replace=False is the more intuitive default (and not sure True should be supported at all!). If you want to ensure backward compatibility I'm happy to do 1, and add a deprecation warning asking people to explicitly set it to True. Otherwise I'd suggest going with 2?

@jonas-scholz123
Copy link
Collaborator Author

I've made the changes now that I think lead to the most intuitive API, let me know if breaking backward compatibility is a hard no, happy to change it to your approach (1) in that case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants