Redesign of ZipDataset #86

adamjstewart · 2021-08-17T14:31:04Z

In my mind, there are several different reasons that someone might want to combine two GeoDatasets:

Combine image and target labels to sample both simultaneously (Landsat8 + CDL)
Combine datasets for multiple sensors (Landsat8 + Landsat7 or Landsat8 + Sentinel)
Combine datasets for disparate geospatial locations (ChesapeakeDE + ChesapeakeMD)

Right now, ZipDataset is designed to exclusively handle case 1. Case 2 doesn't work because the "image" key gets replaced instead of merged or concatenated. Case 3 doesn't work because we explicitly check for overlap between datasets.

We need to think about whether it is possible to support all possible use cases, whether there are any additional use cases, and how to implement this support. Ideally, these could all be wrapped into ZipDataset so that addition handles everything. Hopefully we don't need to add an additional ABC for MergeDataset or something like that.

The text was updated successfully, but these errors were encountered:

calebrob6 · 2021-09-14T06:09:12Z

Currently the ZipDataset does not work at all when you want to concatenate two spatial datasets that are not overlapping:

The ZipDataset computes the intersection of the sub-dataset's bounds as the new bounds. This should be a union if we are concatenating datasets.
The __getitem__ method will throw an error if all the sub-datasets do not contain the query. When concatenating datasets there will obviously be sub-datasets that do not contain a query.

One way around this in my mind is to rename ZipDataset to IntersectionDataset and create an UnionDataset.
Another way is to change the default behavior of ZipDataset to act as a "UnionDataset. For case 1 that you mention above it might should return nodata.

calebrob6 · 2021-09-14T15:52:10Z

Also, from conversations with interested users, use case 2 above is in demand. E.g. this use case is especially important in data fusion research lines.

adamjstewart · 2021-09-14T16:19:52Z

Of the 3 possible use cases, I think 1 is definitely the most common and should be the default when you add two GeoDatasets.

If we want to support both 2 and 3, I think you're right that we'll need separate base classes for these. The bounds of the dataset (union or intersection) aren't extremely important as the user can specify whatever bounds or index they want to use, the bigger issue is the sample keys.

In case 2, we want to return an image that has been stacked (twice as many channels). In case 3, we want to return an image that has been merged (same number of channels). As far as I can tell, there's no other way to predict the expected behavior. To implement this, we could either have separate wrappers (ZipDataset and MergeDataset or whatever) or add a merge=False arg to ZipDataset. I'm leaning towards the latter as it allows for additional features in the future.

The way that our samplers are defined, we need to pass a single index R-tree containing all regions that we would like to sample from. In cases 1 and 2, this usually means whichever dataset has less coverage. In case 3, we want to combine both indices into a single index.

So far it sounds like we're looking for a design that supports something like:

Case #	Bounds	Index	Sample
1	intersection	either	single
2	intersection?	either?	stacked
3	union	joined	merged

Does this look right? Are there any other possible use cases?

adamjstewart · 2021-11-19T04:30:49Z

In terms of the API, we have 4 options:

Two separate classes, called IntersectionDataset and UnionDataset
Two separate classes, called StackedDataset and MergedDataset
A single ZipDataset class with a union=False parameter
A single ZipDataset class with a merged=False parameter

Any preferences?

calebrob6 · 2021-11-19T05:17:26Z

So you might want to create the "new dataset" over the intersection or union of the two sub-datasets, then separately, you might want to handle how to combine the data from the 2 differently right:

Stack bands in the "imagery" key
Put one in "imagery" and one in "mask"

I think 3 or 4 would be more flexible right?

adamjstewart · 2021-11-19T16:42:02Z

3 and 4 are definitely more flexible. I can't currently think of a case where you would want to perform a union but stack the images, or perform an intersection but merge the images, but I'm sure someone will have a reason to do so. I'm actually starting to think that it might make more sense to have a collate_fn parameter (much like DataLoader) that allows you to pass a custom function for handling this.

adamjstewart · 2021-11-19T22:33:45Z

Thinking about this more, I think options 1 or 2 are the right way to go. Going back to the above table, use cases 1 and 2 can be handled by IntersectionDataset, while case 3 can be handled by UnionDataset. These two classes are very different.

In IntersectionDataset, the resulting index will only contain bounding boxes with overlap in all datasets. At sampling time, there will never be a case where we don't want to stack images from all datasets (that's the whole point of combining them).

In UnionDataset, the resulting index will simply be the combination of all dataset indices. We'll store the pointer back to the dataset from which each bounding box comes, then sample only from that dataset. In this case, there's nothing to stack since there's only a single result for each query.

If anyone can think of a use case that does not fit one of the 3 above, let me know. Otherwise, I'll continue with options 1 or 2.

calebrob6 · 2021-11-19T23:04:19Z

Then option 1 makes the most sense to me for naming. I can't think of a case where you'd want a UnionDataset where the sub-datasets have spatial overlap and you'd want to extract joint samples from that area.

Can IntersectionDataset know where to put the data from each component dataset based on some property of that Dataset? E.g. I want to do something like:

imagery = RasterDataset("path/to/imagery/", sample_key="image")
masks = RasterDataset("path/to/masks/", sample_key="mask")
dataset = imagery & masks

then get appropriate samples out of it. If multiple sub-datasets ask to be put in "image" then they will be stacked.

adamjstewart · 2021-11-19T23:29:13Z

Oooh, that's a really interesting idea, instead of supporting ds1 + ds2 we could instead support ds1 & ds2 and ds1 | ds2 to differentiate between intersection and union. I'll have to think about whether or not to drop ds1 + ds2 but I'm leaning towards yes since it's ambiguous.

For the key question, would a transform that renames the key handle your use case?

calebrob6 · 2021-11-20T01:20:14Z

For the key question, would a transform that renames the key handle your use case?

Yes, but that seems unnecessarily complicated. RasterDataset already has is_image which controls whether a layer goes in "image" or "mask" (I know I could just modify that property after creating the dataset, but I'm trying to think through the use cases)

adamjstewart added the datasets Geospatial or benchmark datasets label Aug 17, 2021

calebrob6 mentioned this issue Sep 14, 2021

ZipDatasets don't have an index therefore you can't sample properly #135

Closed

calebrob6 mentioned this issue Sep 18, 2021

Overhaul BoundingBox and ZipDataset classes #144

Merged

11 tasks

adamjstewart added this to the 0.2.0 milestone Nov 20, 2021

adamjstewart closed this as completed in #144 Dec 3, 2021

adamjstewart added utilities Utilities for working with geospatial data and removed utilities Utilities for working with geospatial data labels Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign of ZipDataset #86

Redesign of ZipDataset #86

adamjstewart commented Aug 17, 2021

calebrob6 commented Sep 14, 2021

calebrob6 commented Sep 14, 2021

adamjstewart commented Sep 14, 2021

adamjstewart commented Nov 19, 2021

calebrob6 commented Nov 19, 2021

adamjstewart commented Nov 19, 2021

adamjstewart commented Nov 19, 2021

calebrob6 commented Nov 19, 2021 •

edited

Loading

adamjstewart commented Nov 19, 2021 •

edited

Loading

calebrob6 commented Nov 20, 2021 •

edited

Loading

Redesign of ZipDataset #86

Redesign of ZipDataset #86

Comments

adamjstewart commented Aug 17, 2021

calebrob6 commented Sep 14, 2021

calebrob6 commented Sep 14, 2021

adamjstewart commented Sep 14, 2021

adamjstewart commented Nov 19, 2021

calebrob6 commented Nov 19, 2021

adamjstewart commented Nov 19, 2021

adamjstewart commented Nov 19, 2021

calebrob6 commented Nov 19, 2021 • edited Loading

adamjstewart commented Nov 19, 2021 • edited Loading

calebrob6 commented Nov 20, 2021 • edited Loading

calebrob6 commented Nov 19, 2021 •

edited

Loading

adamjstewart commented Nov 19, 2021 •

edited

Loading

calebrob6 commented Nov 20, 2021 •

edited

Loading