Skip to content

Latest commit

 

History

History
49 lines (25 loc) · 3.99 KB

h3.md

File metadata and controls

49 lines (25 loc) · 3.99 KB

Strategy for storing and querying dataset H3 indices

Each pre-indexed datasets have a corresponding h3.parquet and metadata.json file.

The metadata.json contains the fields to be inserted into the datasets table and an elasticsearch index of the same name.

The h3.parquet contains the uncompacted* h3 indices at resolution 8 representing the geographic coverage of the geospatial dataset.

* See https://h3geo.org/docs/highlights/indexing/ for what uncompacted/compacted means.

Storage

Ingesting the metadata from file to db/index is a straightforward, more or less 1-1 mapping of fields. The h3 indices, on the other hand, cannot easily be stored as is without the table/dataset size inflating quickly as the number of datasets indexed grows.

Instead, we store a compacted version to greatly lessen the number of h3 indices stored especially for datasets covering large "uninterrupted" swathes (think national boundaries).

We also implement some sort of caching - for a given dataset's compacted tile, we store all of its parents up to the lowest/coarsest resolution. For tiles that have the same parent, these parent tiles are deduplicated. These act as "lookup" tiles to determine whether a dataset has at least one child under that tile indicating coverage. Later on why this is important.

There's a yet to be deployed optimization that also deduplicates datasets that have exactly the same h3 tile coverage . There is a lot of these in national datasets - duplicates on a country level (geography dimension) and even on yearly level (time dimension) which quickly add up to multiple gigabytes of storage.

Querying

For lower zoom levels (zoomed "out"), we aggregate dataset counts per lower resolution cells.

And vice versa for higher zoom levels (zoomed "in") and higher resolution cells. The mapping between zoom levels and h3 resolutions can be found in constants/h3.ts. It is somewhat arbitrary, and was tuned during development based roughly on visual appearance and performance.

We batch query dataset counts in tiles

One batch corresponds to an OSM/slippy map tile represented in a z/x/y notation. That said, we actually map h3 resolutions to aggregate on with an OSM z-index, and not the actual zoom level of the web map. It can be easy to confuse the two.

This is an important distinction because the map zoom level changes in a continuous manner (and indeed can even have decimal values) whereas the OSM z-index are strictly integers.

Another thing to note is that we map z-indices to h3 resolutions in a stepped manner. We do not need to load a different resolution per change in z-index and thus we can skip some of them.

https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Zoom_levels

So what is the actual querying strategy?

Per OSM tile, we use an h3 geospatial function to fill that rectangle with h3 indices of a given resolution. We refer to them as "fill" tiles from hereon.

Per fill tile, we create an intermediate table (via cte) with all of its parent tiles (up to the lowest resolution). We then JOIN it with the datasets' h3 tiles if there's an equality between them and any of the fill tiles' parents or itself.

This captures all dataset tiles that are of the same or lower resolutions of the fill tiles.

But what about tiles that are of higher resolutions?

We use a precomputed "lookup" table with "lookup" tiles telling us if a particular dataset has any compacted tile/s that is a child/ren of the lookup tile. Instead of doing expensive WHERE EXISTS queries, we can simply join the fill tiles with the lookup tiles to determine if the former contains any dataset tiles of higher resolution.

With h3, aggregating geospatial datasets contained by these tiles do not need to be performed with geospatial functions. They are simply joins based on number equality (h3 tiles are just 64-bit integers represented with a 15-16 character hexadecimal string, e.g. 8928308280fffff)