Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable schema-naive ingestion #49

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link
Collaborator

In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.

The current Arrow-based conversion approach uses two alternate methods:

  • a fully in-memory approach, where schema inference happens in memory. This works but is constrained to a size of data that can fit into memory at once.
  • forcing the user to provide a full schema for their data. This is quite a lot of work, and the average user will not know how to construct an Arrow schema to describe their input data.

Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.

To avoid the user needing to know the full set of asset names, we define assets under a Map type, which has pros and cons as noted in #48. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know the red asset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.

But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.

So this scalable STAC ingestion would become a two-step process:

  1. Convert STAC to a "flexible schema GeoParquet"
  2. Convert this intermediate Parquet format into STAC-GeoParquet spec-compliant files. This step could also exclude any columns that are defined by the spec but not included in any JSON file. (It's easy from the Parquet metadata to see if any column is fully null).

The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.

Change list

  • Implement class-based schema handlers (PartialSchema). Note that this requires a certain amount of complexity because the schema for how we want data to reside in memory is not necessarily the same as the schema used for parsing input dicts.
    • Implement "partial" schemas for the core item spec and for several popular extensions
    • This also allows users who have data with a custom extension to implement only the custom fields defined by their extension, instead of creating a full STAC schema from scratch.
  • Test (successfully) with NAIP STAC input from planetary computer

This heavily uses pyarrow.unify_schemas to be able to work with partial schemas (for the core spec and for each extension).

This continues the discussion started in #48.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant