-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of ZipDataset #86
Comments
Currently the ZipDataset does not work at all when you want to concatenate two spatial datasets that are not overlapping:
One way around this in my mind is to rename ZipDataset to IntersectionDataset and create an UnionDataset. |
Also, from conversations with interested users, use case 2 above is in demand. E.g. this use case is especially important in data fusion research lines. |
Of the 3 possible use cases, I think 1 is definitely the most common and should be the default when you add two GeoDatasets. If we want to support both 2 and 3, I think you're right that we'll need separate base classes for these. The bounds of the dataset (union or intersection) aren't extremely important as the user can specify whatever bounds or index they want to use, the bigger issue is the sample keys. In case 2, we want to return an image that has been stacked (twice as many channels). In case 3, we want to return an image that has been merged (same number of channels). As far as I can tell, there's no other way to predict the expected behavior. To implement this, we could either have separate wrappers ( The way that our samplers are defined, we need to pass a single index R-tree containing all regions that we would like to sample from. In cases 1 and 2, this usually means whichever dataset has less coverage. In case 3, we want to combine both indices into a single index. So far it sounds like we're looking for a design that supports something like:
Does this look right? Are there any other possible use cases? |
In terms of the API, we have 4 options:
Any preferences? |
So you might want to create the "new dataset" over the intersection or union of the two sub-datasets, then separately, you might want to handle how to combine the data from the 2 differently right:
I think 3 or 4 would be more flexible right? |
3 and 4 are definitely more flexible. I can't currently think of a case where you would want to perform a union but stack the images, or perform an intersection but merge the images, but I'm sure someone will have a reason to do so. I'm actually starting to think that it might make more sense to have a |
Thinking about this more, I think options 1 or 2 are the right way to go. Going back to the above table, use cases 1 and 2 can be handled by IntersectionDataset, while case 3 can be handled by UnionDataset. These two classes are very different. In IntersectionDataset, the resulting index will only contain bounding boxes with overlap in all datasets. At sampling time, there will never be a case where we don't want to stack images from all datasets (that's the whole point of combining them). In UnionDataset, the resulting index will simply be the combination of all dataset indices. We'll store the pointer back to the dataset from which each bounding box comes, then sample only from that dataset. In this case, there's nothing to stack since there's only a single result for each query. If anyone can think of a use case that does not fit one of the 3 above, let me know. Otherwise, I'll continue with options 1 or 2. |
Then option 1 makes the most sense to me for naming. I can't think of a case where you'd want a UnionDataset where the sub-datasets have spatial overlap and you'd want to extract joint samples from that area. Can IntersectionDataset know where to put the data from each component dataset based on some property of that Dataset? E.g. I want to do something like:
then get appropriate samples out of it. If multiple sub-datasets ask to be put in "image" then they will be stacked. |
Oooh, that's a really interesting idea, instead of supporting For the key question, would a transform that renames the key handle your use case? |
Yes, but that seems unnecessarily complicated. RasterDataset already has |
In my mind, there are several different reasons that someone might want to combine two GeoDatasets:
Right now, ZipDataset is designed to exclusively handle case 1. Case 2 doesn't work because the "image" key gets replaced instead of merged or concatenated. Case 3 doesn't work because we explicitly check for overlap between datasets.
We need to think about whether it is possible to support all possible use cases, whether there are any additional use cases, and how to implement this support. Ideally, these could all be wrapped into
ZipDataset
so that addition handles everything. Hopefully we don't need to add an additional ABC forMergeDataset
or something like that.The text was updated successfully, but these errors were encountered: