Scalable schema-naive ingestion #49

kylebarron · 2024-04-26T21:37:41Z

In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.

The current Arrow-based conversion approach uses two alternate methods:

a fully in-memory approach, where schema inference happens in memory. This works but is constrained to a size of data that can fit into memory at once.
forcing the user to provide a full schema for their data. This is quite a lot of work, and the average user will not know how to construct an Arrow schema to describe their input data.

Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.

To avoid the user needing to know the full set of asset names, we define assets under a Map type, which has pros and cons as noted in #48. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know the red asset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.

But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.

So this scalable STAC ingestion would become a two-step process:

Convert STAC to a "flexible schema GeoParquet"
Convert this intermediate Parquet format into STAC-GeoParquet spec-compliant files. This step could also exclude any columns that are defined by the spec but not included in any JSON file. (It's easy from the Parquet metadata to see if any column is fully null).

The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.

Change list

Implement class-based schema handlers (PartialSchema). Note that this requires a certain amount of complexity because the schema for how we want data to reside in memory is not necessarily the same as the schema used for parsing input dicts.
- Implement "partial" schemas for the core item spec and for several popular extensions
- This also allows users who have data with a custom extension to implement only the custom fields defined by their extension, instead of creating a full STAC schema from scratch.
Test (successfully) with NAIP STAC input from planetary computer

This heavily uses pyarrow.unify_schemas to be able to work with partial schemas (for the core spec and for each extension).

This continues the discussion started in #48.

Scalable schema-naive ingestion

9080ed1

kylebarron mentioned this pull request May 9, 2024

Exhaustive schema inference #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable schema-naive ingestion #49

Scalable schema-naive ingestion #49

kylebarron commented Apr 26, 2024

Scalable schema-naive ingestion #49

Are you sure you want to change the base?

Scalable schema-naive ingestion #49

Conversation

kylebarron commented Apr 26, 2024

Change list