Fix duplicate context point sampling in "int" strategy for pandas #153

jonas-scholz123 · 2025-02-06T14:10:47Z

📝 Description

Bugfix: previously, when sampling an integer number of context points from a pandas dataframe, duplicate points could be sampled, leading to unexpected behaviour such as duplicate context points and not sampling all points when passing N=number of total available context points.

This fixes the issue by passing replace=False in the appropriate function call, and adds a unit test to cover the new behaviour.

✅ Checklist before requesting a review

I have installed developer dependencies with pip install .[dev] and running pre-commit install (or alternatively, manually running ruff format before commiting)

If changing or adding source code:

tests are included and are passing (run pytest).
documentation is included or updated as relevant, including docstrings.

If changing or adding documentation:

docs build successfully (jupyter-book build docs --all) and the changes look good from a manual inspection of the HTML in docs/_build/html/.

tom-andersson

Hi @jonas-scholz123, thanks for the PR! Happy to merge, but some initial thoughts:

I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).

That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N context points (e.g. for a sensor placement experiment).

My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either

make replace an attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or
explicitly catch N > df.index.size and raise a SamplingTooManyPointsError with an intuitive error message, which you assertRaises in your unit test.

tom-andersson · 2025-02-10T17:20:41Z

tests/test_task_loader.py

+        x_coords = task["X_c"][0][0]
+        y_coords = task["X_c"][0][1]


Call this x1 and x2 to follow the notation in the rest of the codebase

tom-andersson · 2025-02-10T17:24:05Z

tests/test_task_loader.py

+
+        num_unique_coords = len(self.df.xs("2020-01-01", level="time").index)
+
+        task = tl("2020-01-01", num_unique_coords, 10)


Can you use explicit keyword arguments here? Took me a minute to remember what the 10 refers to from the position haha. Could also do something like not_relevant_for_this_test=10.

Sure, will do

tom-andersson · 2025-02-10T17:31:17Z

tests/test_task_loader.py

+
+        num_unique_coords = len(self.df.xs("2020-01-01", level="time").index)
+
+        task = tl("2020-01-01", num_unique_coords, 10)


Just noting two small issues with this test:

even with replace=True as we had before, we could by chance not sample any duplicates and pass this test. p(no_duplicates) goes down as N approaches the size of the possible values though, which you've done here, so maybe add a comment to that effect.

you aren't setting the random seed of the TaskLoader with seed_override in the call site here, so the test will be stochastic, is that deliberate?

Yeah I noticed those, but the chance that you don't hit a single duplicate point when sampling 1000x from a dataset with 1000 locations is basically 0, so I didn't bother to make it more robust!

Happy to set a random seed

jonas-scholz123 · 2025-02-10T17:56:13Z

Thanks for your CR! Some thoughts:

I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).

I'm not sure I follow, we want the same data frame row to appear multiple times in the same task/context set/encoding? Why?

I would understand if we had different measurements at the same location and would want to be able to sample those, but in what scenario would users want the same observation twice?

That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N context points (e.g. for a sensor placement experiment).

I think this is quite unintuitive, e.g. when I specify a "split", of 80% context 20% target, I would expect

All the stations to be part of either the context or the target set
The context set to contain 80% of the stations
No stations to be featured twice

I'm not sure which is the case now, but I think 3 and either 1 or 2 are currently not the case.

My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either

make replace an attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or

explicitly catch N > df.index.size and raise a SamplingTooManyPointsError with an intuitive error message, which you assertRaises in your unit test.

Does deepsensor guarantee backward stability at this point? I would argue that replace=False is the more intuitive default (and not sure True should be supported at all!). If you want to ensure backward compatibility I'm happy to do 1, and add a deprecation warning asking people to explicitly set it to True. Otherwise I'd suggest going with 2?

jonas-scholz123 · 2025-02-11T12:12:14Z

I've made the changes now that I think lead to the most intuitive API, let me know if breaking backward compatibility is a hard no, happy to change it to your approach (1) in that case

jonas-scholz123 added the bug Something isn't working label Feb 6, 2025

jonas-scholz123 requested a review from tom-andersson February 6, 2025 14:10

jonas-scholz123 self-assigned this Feb 6, 2025

tom-andersson requested changes Feb 10, 2025

View reviewed changes

jonas-scholz123 force-pushed the fix-int-sampling branch from 554cd32 to 4a3f9de Compare February 11, 2025 12:10

Fix duplicate context point sampling in 'int' strategy for pandas

0a09d63

jonas-scholz123 force-pushed the fix-int-sampling branch from 4a3f9de to 0a09d63 Compare February 11, 2025 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate context point sampling in "int" strategy for pandas #153

Fix duplicate context point sampling in "int" strategy for pandas #153

jonas-scholz123 commented Feb 6, 2025

tom-andersson left a comment

tom-andersson Feb 10, 2025

tom-andersson Feb 10, 2025

jonas-scholz123 Feb 10, 2025

tom-andersson Feb 10, 2025

jonas-scholz123 Feb 10, 2025

jonas-scholz123 commented Feb 10, 2025

jonas-scholz123 commented Feb 11, 2025


		num_unique_coords = len(self.df.xs("2020-01-01", level="time").index)

		task = tl("2020-01-01", num_unique_coords, 10)

Fix duplicate context point sampling in "int" strategy for pandas #153

Are you sure you want to change the base?

Fix duplicate context point sampling in "int" strategy for pandas #153

Conversation

jonas-scholz123 commented Feb 6, 2025

📝 Description

✅ Checklist before requesting a review

tom-andersson left a comment

Choose a reason for hiding this comment

tom-andersson Feb 10, 2025

Choose a reason for hiding this comment

tom-andersson Feb 10, 2025

Choose a reason for hiding this comment

jonas-scholz123 Feb 10, 2025

Choose a reason for hiding this comment

tom-andersson Feb 10, 2025

Choose a reason for hiding this comment

jonas-scholz123 Feb 10, 2025

Choose a reason for hiding this comment

jonas-scholz123 commented Feb 10, 2025

jonas-scholz123 commented Feb 11, 2025