Add `PointCloud` subclass of `Vector` #492

rhugonnet · 2024-02-16T17:41:23Z

This PR adds the PointCloud class to facilitate interfacing with this special sub-type of vector data very common in geospatial analysis.
In short, a point cloud is a subclass of Vector containing only 2D point geometries and associated to a main data column (+ optionally other auxiliary data columns). This main data column represents the values .data of the point cloud, that can be compared to the .data of a raster format (which is impossible for a typical vector), or of that of another point cloud if some of their coordinates are the same, or within close tolerance.

Additionally, all functionalities of Raster using .data that just need coordinates can then be ported to PointCloud (array interface with NumPy, subsampling, binning, zonal stats, variography, etc).

Summary

The idea behind the implementation of the PointCloud class is two-fold:

Substantially improve the interfacing of point clouds with vectors or raster, which are currently based on passing volatile arguments of "coordinates" (that could be defined with many shapes/formats) or returning volative 1D arrays of data values associated with no coordinates: https://geoutils.readthedocs.io/en/stable/raster_vector_point.html#rasterpoint-operations.
Natively support large point cloud files of type LAS that are also used widely, are not supported by GeoPandas (didn't find a single issue on this, not planned at all), which can benefit from the same load mechanism as Raster and support chunked reading.

For this, we add the new optional dependency laspy for reading and writing LAS-type format. And we override some of the behaviour of Vector in PointCloud to allow for a load() functionality.

Details

This PR adds the PointCloud class, including:

A new data_column attribute containing the data column name of the point cloud,
A new data attribute corresponding to the data of the point cloud,
A new all_columns attribute containing all data columns of the point cloud (Vector.columns also contains geometry),
A new is_loaded attribute and load() method for when data is read from a LAS file,
A new nb_points attribute returning the number of points in the point cloud,
A new set of from_array(), from_tuples() and from_xyz() methods to easily create a point cloud from different inputs, and their equivalent to_array(), to_tuples() and to_xyz(),
A new pointcloud_equal() method simply checking vector_equal() and that the data_column is equal as well,
Wraps the existing grid() functionality as a class method.

The new PointCloud class overrides the attributes ds, crs, bounds and columns to allow loading only metadata, and implicitly loading the point cloud.

This PR updates existing input/output of some functions:

Raster.to_pointcloud() now returns a PointCloud instead of a Vector.

Left to-do for this PR

Small changes:

Add LAS file and Point-only vector file in geoutils-data to run tests,
Update Raster.from_regular_pointcloud() to accept a PointCloud input,
Update Raster.interp_points() and Raster.reduce_points() to accept a PointCloud as input for coordinates,
Update Vector.create_mask() to accept a PointCloud the manipulation of masking and booleans for point clouds,
Mirror every function relying only on adimensional .data (same whether 1D or 2D) to point cloud: subsample(), get_stats() (and function that work on 1D sets of coordinates like variogram(), once moved to GeoUtils).
Add __array_interface__ support for point clouds with the same coordinates, simply pointing to that of self.ds[data_column]?
Populate documentation page "The georeferenced point cloud", update the "Raster-Point interface" sections, update Quick start and Feature overview, and add a gallery example?

Potential bigger change?

This last change could be to:

Allow for arithmetic operations between Raster and PointCloud? (e.g., defaulting to using interp_points at the PointCloud coordinates, the comparison method could be tweaked in geoutils.config),

It could work as:

pc = PointCloud("pc_example.shp")
raster = Raster("rast_example.tif")

# An arithmetic operation between a raster and point cloud would always return a PointCloud
pc_diff = pc - raster

# Users could change the behaviour in the config
geoutils.config["raster_point.comparison_method"] = "reduce_points"
geoutils.config["raster_point.resampling"] = "cubic"

The only problem I see is that interpolation is not ideal for a point cloud much denser than a raster (e.g., a lidar point cloud). In that case, it's better to grid the point cloud into a raster, and compare the two rasters. But that is computationally-intensive, so happening under-the-hood is not ideal.
We could have a criteria based on point cloud density relative to the raster size:

Above a certain density, interpolate raster at the points; below that, grid the points,
Or always follow the same behaviour, but raise a warning if the point density is not ideal?

Or we do not add such a functionality and always leave this to the user, which remains fairly short as long as we support Raster/Raster arithmetic and PointCloud/PointCloud arithmetic:

rast_from_pc = pc.grid(rast)
my_op = (rast + rast_from_pc) / 4

pc_from_rast = rast.interp_points(pc)
my_op = (pc_from_rast + pc) / 4

And add a public function to calculate relative point density to help users decide when need be?

Future functionalities

Add pointcloud to pointcloud comparison with non-equal coordinates by grouping closest points within a certain tolerance,
Add other common gridding/interpolation methods for points, or for raster-point comparison: IDW, kriging.
Mirror the PointCloud class in a GeoPandas accessor (technically Pandas accessor, see here: ENH: Add register_geo(series|dataframe)_accessor API (#1947) geopandas/geopandas#1952 (comment)).

Resolves #463

rhugonnet · 2024-12-06T23:48:35Z

@adehecq @atedstone @erikmannerfelt This PR is now advanced enough to get your feedback! 😊
I think it's good we discuss it now before moving further forward, documenting and finalizing, to be sure we agree on all the concepts and behaviour introduced! I updated the description above to a good level of detail, which should allow you to grasp the idea behind this new object 😉.

All tests passing locally with a LAS file I have. Will need to add one to geoutils-data for CI to pass.

atedstone · 2024-12-10T11:24:40Z

@rhugonnet , this looks exceptionally well thought through. I read your descriptions and the changed files,but have not tried out the functionality myself as I don't immediately have files lying around that would be a good test case. But from my side, this looks good to progress to finishing the outstanding tasks you listed.

Concerning the bigger possible change about arithmetic ops between Raster and PointCloud: my instinct would be to require the user to decide, so that the comparison is explicit and clear.

adehecq · 2025-02-06T08:16:17Z

geoutils/__init__.py

@@ -2,10 +2,12 @@
 GeoUtils is a Python package for the analysis of geospatial data.
 """

-from geoutils import examples, pointcloud, projtools, raster, vector  # noqa
+from geoutils import examples, projtools  # noqa


So we don't want to load the raster submodule explicitly anymore? Is there nothing else to get apart from the Raster and Mask classes? (I'm not fully familiar with the new structure yet)
In any case, it is a bit strange that the change occurs with this PR and not in the restructuring one. But I understand that we may have forgotten this earlier.

adehecq · 2025-02-06T08:19:58Z

geoutils/raster/raster.py

@@ -28,6 +28,7 @@
 from rasterio.enums import Resampling
 from rasterio.plot import show as rshow

+import geoutils as gu


Does it not lead to a circular import?? Would from geoutils import Vector be better?

adehecq · 2025-02-06T08:21:27Z