Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of ZipDataset #86

Closed
adamjstewart opened this issue Aug 17, 2021 · 10 comments · Fixed by #144
Closed

Redesign of ZipDataset #86

adamjstewart opened this issue Aug 17, 2021 · 10 comments · Fixed by #144
Labels
datasets Geospatial or benchmark datasets
Milestone

Comments

@adamjstewart
Copy link
Collaborator

In my mind, there are several different reasons that someone might want to combine two GeoDatasets:

  1. Combine image and target labels to sample both simultaneously (Landsat8 + CDL)
  2. Combine datasets for multiple sensors (Landsat8 + Landsat7 or Landsat8 + Sentinel)
  3. Combine datasets for disparate geospatial locations (ChesapeakeDE + ChesapeakeMD)

Right now, ZipDataset is designed to exclusively handle case 1. Case 2 doesn't work because the "image" key gets replaced instead of merged or concatenated. Case 3 doesn't work because we explicitly check for overlap between datasets.

We need to think about whether it is possible to support all possible use cases, whether there are any additional use cases, and how to implement this support. Ideally, these could all be wrapped into ZipDataset so that addition handles everything. Hopefully we don't need to add an additional ABC for MergeDataset or something like that.

@calebrob6
Copy link
Member

Currently the ZipDataset does not work at all when you want to concatenate two spatial datasets that are not overlapping:

  • The ZipDataset computes the intersection of the sub-dataset's bounds as the new bounds. This should be a union if we are concatenating datasets.
  • The __getitem__ method will throw an error if all the sub-datasets do not contain the query. When concatenating datasets there will obviously be sub-datasets that do not contain a query.

One way around this in my mind is to rename ZipDataset to IntersectionDataset and create an UnionDataset.
Another way is to change the default behavior of ZipDataset to act as a "UnionDataset. For case 1 that you mention above it might should return nodata.

@calebrob6
Copy link
Member

Also, from conversations with interested users, use case 2 above is in demand. E.g. this use case is especially important in data fusion research lines.

@adamjstewart
Copy link
Collaborator Author

Of the 3 possible use cases, I think 1 is definitely the most common and should be the default when you add two GeoDatasets.

If we want to support both 2 and 3, I think you're right that we'll need separate base classes for these. The bounds of the dataset (union or intersection) aren't extremely important as the user can specify whatever bounds or index they want to use, the bigger issue is the sample keys.

In case 2, we want to return an image that has been stacked (twice as many channels). In case 3, we want to return an image that has been merged (same number of channels). As far as I can tell, there's no other way to predict the expected behavior. To implement this, we could either have separate wrappers (ZipDataset and MergeDataset or whatever) or add a merge=False arg to ZipDataset. I'm leaning towards the latter as it allows for additional features in the future.

The way that our samplers are defined, we need to pass a single index R-tree containing all regions that we would like to sample from. In cases 1 and 2, this usually means whichever dataset has less coverage. In case 3, we want to combine both indices into a single index.

So far it sounds like we're looking for a design that supports something like:

Case # Bounds Index Sample
1 intersection either single
2 intersection? either? stacked
3 union joined merged

Does this look right? Are there any other possible use cases?

@adamjstewart
Copy link
Collaborator Author

In terms of the API, we have 4 options:

  1. Two separate classes, called IntersectionDataset and UnionDataset
  2. Two separate classes, called StackedDataset and MergedDataset
  3. A single ZipDataset class with a union=False parameter
  4. A single ZipDataset class with a merged=False parameter

Any preferences?

@calebrob6
Copy link
Member

So you might want to create the "new dataset" over the intersection or union of the two sub-datasets, then separately, you might want to handle how to combine the data from the 2 differently right:

  • Stack bands in the "imagery" key
  • Put one in "imagery" and one in "mask"

I think 3 or 4 would be more flexible right?

@adamjstewart
Copy link
Collaborator Author

3 and 4 are definitely more flexible. I can't currently think of a case where you would want to perform a union but stack the images, or perform an intersection but merge the images, but I'm sure someone will have a reason to do so. I'm actually starting to think that it might make more sense to have a collate_fn parameter (much like DataLoader) that allows you to pass a custom function for handling this.

@adamjstewart
Copy link
Collaborator Author

Thinking about this more, I think options 1 or 2 are the right way to go. Going back to the above table, use cases 1 and 2 can be handled by IntersectionDataset, while case 3 can be handled by UnionDataset. These two classes are very different.

In IntersectionDataset, the resulting index will only contain bounding boxes with overlap in all datasets. At sampling time, there will never be a case where we don't want to stack images from all datasets (that's the whole point of combining them).

In UnionDataset, the resulting index will simply be the combination of all dataset indices. We'll store the pointer back to the dataset from which each bounding box comes, then sample only from that dataset. In this case, there's nothing to stack since there's only a single result for each query.

If anyone can think of a use case that does not fit one of the 3 above, let me know. Otherwise, I'll continue with options 1 or 2.

@calebrob6
Copy link
Member

calebrob6 commented Nov 19, 2021

Then option 1 makes the most sense to me for naming. I can't think of a case where you'd want a UnionDataset where the sub-datasets have spatial overlap and you'd want to extract joint samples from that area.

Can IntersectionDataset know where to put the data from each component dataset based on some property of that Dataset? E.g. I want to do something like:

imagery = RasterDataset("path/to/imagery/", sample_key="image")
masks = RasterDataset("path/to/masks/", sample_key="mask")
dataset = imagery & masks

then get appropriate samples out of it. If multiple sub-datasets ask to be put in "image" then they will be stacked.

@adamjstewart
Copy link
Collaborator Author

adamjstewart commented Nov 19, 2021

Oooh, that's a really interesting idea, instead of supporting ds1 + ds2 we could instead support ds1 & ds2 and ds1 | ds2 to differentiate between intersection and union. I'll have to think about whether or not to drop ds1 + ds2 but I'm leaning towards yes since it's ambiguous.

For the key question, would a transform that renames the key handle your use case?

@calebrob6
Copy link
Member

calebrob6 commented Nov 20, 2021

For the key question, would a transform that renames the key handle your use case?

Yes, but that seems unnecessarily complicated. RasterDataset already has is_image which controls whether a layer goes in "image" or "mask" (I know I could just modify that property after creating the dataset, but I'm trying to think through the use cases)

@adamjstewart adamjstewart added this to the 0.2.0 milestone Nov 20, 2021
@adamjstewart adamjstewart added utilities Utilities for working with geospatial data and removed utilities Utilities for working with geospatial data labels Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants