[WIP] Proposed vector data format and application to LandIQ data #3423

dlebauer · 2025-01-26T00:16:38Z

Description

This PR introduces two new functions to the PEcAn.data.land package:

landiq2std: Processes LandIQ crop map Shapefiles into a standardized format with a GeoPackage + CSV, including mapping crops to PFTS.
shp2gpkg: Converts a Shapefile to a GeoPackage while ensuring spatial data integrity and optional geometry repairs. This is mostly a helper function

Motivation and Context

The goal here is to propose a new format for handling geospatial data with large tables.

This is motivated by the CCMMF project, and the use of the LandIQ crop datasets in particular, but should be generalizable to other workflows that use vector geospatial data.

It is also motivated by the desire to decouple workflows from BETYdb and its associated dependence on Posgres+PostGIS that has often been more of a barrier than originally envisioned.

Other options:

store tables in a relational geospatial database like SQLite or duckdb. A geopackage is the OGC standard alternative to shapefiles and is built on SQLite, to this is a viable alternative. The disadvantages here include
- 1. could get large with lots of data (though that could be solved by using multiple files and
- 1. CSVs are easier to use

Linking CSVs and GPKG

Spatial joins can be slow and we don't want to store geometries in the CSV (they are large as text and that would be redundant).

As proposed the tables are linked by an id generated as a hash by the digest::digest() function based on the geometry. This adds a new dependency (though it is already used in the api).

An alternative to using the hash as an id would be to would be to store lat+lon of the centroid in both the GPKG and all associated CSVs. Then joins could be on lat+lon.

This would have the advantage of allowing some (many) uses of the CSVs independent of spatial files and libraries.

There is a nonzero chance that distinct geometries could have the same centroid (eg if two cells from raster's with different resolution and perhaps other edge cases.

TODO

rename geom --> geometry
incorporate code used to produce 2018-2023 dataset (see /projectnb/dietzelab/malmborg/CARB/CARB_R/*R)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

…dard'

…dded additional assertions.

dlebauer · 2025-01-26T00:48:21Z

modules/data.land/R/shp2gpkg.R

+  # Load the Shapefile
+  shapefile <- sf::st_read(input_shp, quiet = TRUE)
+
+  # Check validity of geometries


@camalmborg this is the logic for reparing geometries. The general approach within pecan is to bring in external data and immediately converted to a common standard. shp2gpkg can be part of that workflow, as it is here.