-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Proposed vector data format and application to LandIQ data #3423
base: develop
Are you sure you want to change the base?
Conversation
# Load the Shapefile | ||
shapefile <- sf::st_read(input_shp, quiet = TRUE) | ||
|
||
# Check validity of geometries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camalmborg this is the logic for reparing geometries. The general approach within pecan is to bring in external data and immediately converted to a common standard. shp2gpkg can be part of that workflow, as it is here.
#' @param output_gpkg Character. Path to the output GeoPackage | ||
#' @param output_csv Character. Path to the output CSV. | ||
#' | ||
#' @return Invisibly returns a list with paths to the output files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the output paths ever differ from the ones specified as input? Might be more useful to return a success/failure status instead of information the user already has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I also think returning invisibly is more often confusing than useful, but side-effecting functions like this are a place it can work)
#' @importFrom sf st_read st_transform st_centroid st_coordinates st_write | ||
#' @importFrom dplyr mutate select rename filter case_when distinct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove imports and use sf::...
and dplyr::...
instead
shp2gpkg(input_file, output_gpkg, overwrite = TRUE) | ||
input_file <- output_gpkg # now gpkg is input file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially unwanted if something fails later in the landiq2std function and leaves the user wondering why the output file was updated even though the command stopped with an error message. Consider waiting until success?
dplyr::mutate( # generate ids rowwise separately because | ||
# rowwise geospatial operations are very slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this wording -- is rowwise the slow way or the fast one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like (at least on my machine, with this particular shapefile, right now with the moon in this phase, etc) rowwise is the slow way:
> system.time(dwr |> rowwise() |> mutate(id = digest::digest(geometry, algo = "xxhash64")))
user system elapsed
8.849 0.107 8.951
> system.time(dwr |> mutate(id = digest::digest(geometry, algo = "xxhash64")))
user system elapsed
0.113 0.015 0.128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the same result as you but borked the comment + implementation
"Cherries", "Almonds", "Plums, Prunes and Apricots", | ||
"Walnuts", "Citrus", "Miscellaneous Deciduous", "Pears", "Olives", | ||
"Apples", "Pistachios", "Bush Berries", "Peaches/Nectarines", | ||
"Miscellaneous Subtropical Fruits", "Pomegranates" | ||
) ~ "woody perennial crop", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another column I don't recognize -- The DWR shapefiles have alphanumeric codes (eg cherry = D3, almond = D12, etc), not cropnames. Am I missing where you merge in the lookup?
(If you have the codes available, I think the woody perennials will likely consist of all of C = "citrus and subtropical", D = "deciduous fruits and nuts", and V = "vineyards", plus T19 = "bush berries", and T28 = "blueberries")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I see your TODO above, but just to make sure: the mapping definitely needs to be made configurable before merging rather than left hardcoded into the function.
dplyr::rename( | ||
source = Source, | ||
notes = Comments | ||
) |> | ||
dplyr::select(id, lat, lon, year, crop, pft, source, notes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is a place I might do the renaming inside the select like ...select(..., source = Source, notes = Comments)
, but this works
#' @details | ||
#' This function reads a Shapefile, converts it to an `sf` object, and writes it to a GeoPackage. | ||
#' The `sf` package handles the conversion and ensures spatial data integrity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I recommend moving details above the parameter listing and removing the @details tag -- that's the layout Roxygen favors and it treats paragraph 1 as the title, paragraph 2 as the description, and paragraphs 3-and-onward as details. Yes, it differs from the layout of an Rd page, but I find it makes it easier to find the parameters when reading in source form.
Description
This PR introduces two new functions to the PEcAn.data.land package:
Motivation and Context
The goal here is to propose a new format for handling geospatial data with large tables.
This is motivated by the CCMMF project, and the use of the LandIQ crop datasets in particular, but should be generalizable to other workflows that use vector geospatial data.
It is also motivated by the desire to decouple workflows from BETYdb and its associated dependence on Posgres+PostGIS that has often been more of a barrier than originally envisioned.
Other options:
Linking CSVs and GPKG
Spatial joins can be slow and we don't want to store geometries in the CSV (they are large as text and that would be redundant).
As proposed the tables are linked by an id generated as a hash by the
digest::digest()
function based on the geometry. This adds a new dependency (though it is already used in the api).An alternative to using the hash as an id would be to would be to store lat+lon of the centroid in both the GPKG and all associated CSVs. Then joins could be on lat+lon.
This would have the advantage of allowing some (many) uses of the CSVs independent of spatial files and libraries.
There is a nonzero chance that distinct geometries could have the same centroid (eg if two cells from raster's with different resolution and perhaps other edge cases.
TODO
Types of changes
Checklist: