Each pre-indexed datasets have a corresponding h3.parquet
and metadata.json
file.
The metadata.json
contains the fields to be inserted into the datasets
table and an elasticsearch index of the same name.
The h3.parquet
contains the uncompacted* h3 indices at resolution 8 representing the geographic coverage of the geospatial dataset.
* See https://h3geo.org/docs/highlights/indexing/ for what uncompacted/compacted means.
Ingesting the metadata from file to db/index is a straightforward, more or less 1-1 mapping of fields. The h3 indices, on the other hand, cannot easily be stored as is without the table/dataset size inflating quickly as the number of datasets indexed grows.
Instead, we store a compacted version to greatly lessen the number of h3 indices stored especially for datasets covering large "uninterrupted" swathes (think national boundaries).
We also implement some sort of caching - for a given dataset's compacted tile, we store all of its parents up to the lowest/coarsest resolution. For tiles that have the same parent, these parent tiles are deduplicated. These act as "lookup" tiles to determine whether a dataset has at least one child under that tile indicating coverage. Later on why this is important.
There's a yet to be deployed optimization that also deduplicates datasets that have exactly the same h3 tile coverage . There is a lot of these in national datasets - duplicates on a country level (geography dimension) and even on yearly level (time dimension) which quickly add up to multiple gigabytes of storage.
And vice versa for higher zoom levels (zoomed "in") and higher resolution cells. The mapping between zoom levels and h3 resolutions can be found in constants/h3.ts. It is somewhat arbitrary, and was tuned during development based roughly on visual appearance and performance.
One batch corresponds to an OSM/slippy map tile represented in a z/x/y notation. That said, we actually map h3 resolutions to aggregate on with an OSM z-index, and not the actual zoom level of the web map. It can be easy to confuse the two.
This is an important distinction because the map zoom level changes in a continuous manner (and indeed can even have decimal values) whereas the OSM z-index are strictly integers.
Another thing to note is that we map z-indices to h3 resolutions in a stepped manner. We do not need to load a different resolution per change in z-index and thus we can skip some of them.
https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Zoom_levels
Per OSM tile, we use an h3 geospatial function to fill that rectangle with h3 indices of a given resolution. We refer to them as "fill" tiles from hereon.
Per fill tile, we create an intermediate table (via cte) with all of its parent tiles (up to the lowest resolution). We then JOIN
it with the datasets' h3 tiles if there's an equality between them and any of the fill tiles' parents or itself.
This captures all dataset tiles that are of the same or lower resolutions of the fill tiles.
We use a precomputed "lookup" table with "lookup" tiles telling us if a particular dataset has any compacted tile/s that is a child/ren of the lookup tile. Instead of doing expensive WHERE EXISTS
queries, we can simply join the fill tiles with the lookup tiles to determine if the former contains any dataset tiles of higher resolution.
With h3, aggregating geospatial datasets contained by these tiles do not need to be performed with geospatial functions. They are simply joins based on number equality (h3 tiles are just 64-bit integers represented with a 15-16 character hexadecimal string, e.g. 8928308280fffff
)