Forest Carbon: Global model vector clipping and reprojection #1875
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #1689
Background:
Enabling
gdal.UseExceptions
led to the discovery that invalid geometries were being created during reprojection of the Forest Carbon "Global Regression Models" vector (forest_carbon_edge_regression_model_parameters.shp
). This was happening even with partial reprojection enabled. Digging into it, it seems partial reprojection doesn't necessarily remove entire geometries, just points that can't be reprojected; as such, you can end up with geometries that have insufficient points to be valid polygons.Based on recommendations from @emlys and @phargogh, this PR does away with partial reprojection and instead clips the global vector prior to reprojection, so that only those polygons relevant to the model run need be reprojected. Clipping is based on the LULC raster's bounding box, plus a buffer (explained below).
While this model does include an AIO input, it's optional and is only used to calculate aggregate statistics at the end of the model run. For consistency, I'm ignoring it for the purposes of clipping.
Bounding Box Buffering:
The Global Regression Model vector data is used to construct a kd-tree. When querying the kd-tree, we set a
distance_upper_bound
. To ensure the clipped vector will contain all of the relevant data to the querying process, I'm buffering the LULC bounding box by the sameDISTANCE_UPPER_BOUND
value (currently set to 500km).I'm also keeping all polygons that intersect with the clipping polygon, rather than using
gdal.Layer.Clip()
, in order to preserve the shape of the polygons (this is important because we use the centroid when constructing the kd-tree).To summarize my clipping methodology:
DISTANCE_UPPER_BOUND
_build_spatial_index
, to be reprojected and used to create the kd-treeAdditional notes:
The User's Guide says, "Note that all spatial inputs must be in the same projected coordinate system and in linear meter units." However, we weren't validating the units within the model. I've updated the model spec to include
"projection_units": u.meter
for validation purposes.Digging into this model, I also noticed a contradiction. The user's guide says, related to the "Number Of Points To Average," (in "How it works"):
"The user can designate the number of local models used in the interpolation scheme which is defaulted to 10 but can range anywhere from 1 (only closest point) to 2635 (every regression model on the planet)."
However, within the code itself, we specify the aforementioned
DISTANCE_UPPER_BOUND
of 500km that we use when querying the kd-tree (there's a note that reads, "grid cells are 100km. Becky says 500km is a good upper bound to search"). We may want to update the User's Guide to reflect this, especially now that we're using thatDISTANCE_UPPER_BOUND
as a buffer when clipping the global vector to the LULC raster's bounding box.Checklist