Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.
The current Arrow-based conversion approach uses two alternate methods:
Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.
To avoid the user needing to know the full set of asset names, we define assets under a
Map
type, which has pros and cons as noted in #48. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know thered
asset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.
So this scalable STAC ingestion would become a two-step process:
The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.
Change list
PartialSchema
). Note that this requires a certain amount of complexity because the schema for how we want data to reside in memory is not necessarily the same as the schema used for parsing input dicts.This heavily uses
pyarrow.unify_schemas
to be able to work with partial schemas (for the core spec and for each extension).This continues the discussion started in #48.