Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PointCloud subclass of Vector #492

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rhugonnet
Copy link
Member

@rhugonnet rhugonnet commented Feb 16, 2024

This PR adds the PointCloud class to facilitate interfacing with this special sub-type of vector data very common in geospatial analysis.
In short, a point cloud is a subclass of Vector containing only 2D point geometries and associated to a main data column (+ optionally other auxiliary data columns). This main data column represents the values .data of the point cloud, that can be compared to the .data of a raster format (which is impossible for a typical vector), or of that of another point cloud if some of their coordinates are the same, or within close tolerance.

Additionally, all functionalities of Raster using .data that just need coordinates can then be ported to PointCloud (array interface with NumPy, subsampling, binning, zonal stats, variography, etc).

Summary

The idea behind the implementation of the PointCloud class is two-fold:

  1. Substantially improve the interfacing of point clouds with vectors or raster, which are currently based on passing volatile arguments of "coordinates" (that could be defined with many shapes/formats) or returning volative 1D arrays of data values associated with no coordinates: https://geoutils.readthedocs.io/en/stable/raster_vector_point.html#rasterpoint-operations.
  2. Natively support large point cloud files of type LAS that are also used widely, are not supported by GeoPandas (didn't find a single issue on this, not planned at all), which can benefit from the same load mechanism as Raster and support chunked reading.

For this, we add the new optional dependency laspy for reading and writing LAS-type format. And we override some of the behaviour of Vector in PointCloud to allow for a load() functionality.

Details

This PR adds the PointCloud class, including:

  • A new data_column attribute containing the data column name of the point cloud,
  • A new data attribute corresponding to the data of the point cloud,
  • A new all_columns attribute containing all data columns of the point cloud (Vector.columns also contains geometry),
  • A new is_loaded attribute and load() method for when data is read from a LAS file,
  • A new nb_points attribute returning the number of points in the point cloud,
  • A new set of from_array(), from_tuples() and from_xyz() methods to easily create a point cloud from different inputs, and their equivalent to_array(), to_tuples() and to_xyz(),
  • A new pointcloud_equal() method simply checking vector_equal() and that the data_column is equal as well,
  • Wraps the existing grid() functionality as a class method.

The new PointCloud class overrides the attributes ds, crs, bounds and columns to allow loading only metadata, and implicitly loading the point cloud.

This PR updates existing input/output of some functions:

  • Raster.to_pointcloud() now returns a PointCloud instead of a Vector.

Left to-do for this PR

Small changes:

  • Add LAS file and Point-only vector file in geoutils-data to run tests,
  • Update Raster.from_regular_pointcloud() to accept a PointCloud input,
  • Update Raster.interp_points() and Raster.reduce_points() to accept a PointCloud as input for coordinates,
  • Update Vector.create_mask() to accept a PointCloud the manipulation of masking and booleans for point clouds,
  • Mirror every function relying only on adimensional .data (same whether 1D or 2D) to point cloud: subsample(), get_stats() (and function that work on 1D sets of coordinates like variogram(), once moved to GeoUtils).
  • Add __array_interface__ support for point clouds with the same coordinates, simply pointing to that of self.ds[data_column]?
  • Populate documentation page "The georeferenced point cloud", update the "Raster-Point interface" sections, update Quick start and Feature overview, and add a gallery example?

Potential bigger change?

This last change could be to:

  • Allow for arithmetic operations between Raster and PointCloud? (e.g., defaulting to using interp_points at the PointCloud coordinates, the comparison method could be tweaked in geoutils.config),

It could work as:

pc = PointCloud("pc_example.shp")
raster = Raster("rast_example.tif")

# An arithmetic operation between a raster and point cloud would always return a PointCloud
pc_diff = pc - raster

# Users could change the behaviour in the config
geoutils.config["raster_point.comparison_method"] = "reduce_points"
geoutils.config["raster_point.resampling"] = "cubic"

The only problem I see is that interpolation is not ideal for a point cloud much denser than a raster (e.g., a lidar point cloud). In that case, it's better to grid the point cloud into a raster, and compare the two rasters. But that is computationally-intensive, so happening under-the-hood is not ideal.
We could have a criteria based on point cloud density relative to the raster size:

  • Above a certain density, interpolate raster at the points; below that, grid the points,
  • Or always follow the same behaviour, but raise a warning if the point density is not ideal?

Or we do not add such a functionality and always leave this to the user, which remains fairly short as long as we support Raster/Raster arithmetic and PointCloud/PointCloud arithmetic:

rast_from_pc = pc.grid(rast)
my_op = (rast + rast_from_pc) / 4

pc_from_rast = rast.interp_points(pc)
my_op = (pc_from_rast + pc) / 4

And add a public function to calculate relative point density to help users decide when need be?

Future functionalities

Resolves #463

@rhugonnet rhugonnet marked this pull request as ready for review December 6, 2024 23:40
@rhugonnet
Copy link
Member Author

@adehecq @atedstone @erikmannerfelt This PR is now advanced enough to get your feedback! 😊
I think it's good we discuss it now before moving further forward, documenting and finalizing, to be sure we agree on all the concepts and behaviour introduced! I updated the description above to a good level of detail, which should allow you to grasp the idea behind this new object 😉.

All tests passing locally with a LAS file I have. Will need to add one to geoutils-data for CI to pass.

@atedstone
Copy link
Member

@rhugonnet , this looks exceptionally well thought through. I read your descriptions and the changed files,but have not tried out the functionality myself as I don't immediately have files lying around that would be a good test case. But from my side, this looks good to progress to finishing the outstanding tasks you listed.

Concerning the bigger possible change about arithmetic ops between Raster and PointCloud: my instinct would be to require the user to decide, so that the comparison is explicit and clear.

@@ -2,10 +2,12 @@
GeoUtils is a Python package for the analysis of geospatial data.
"""

from geoutils import examples, pointcloud, projtools, raster, vector # noqa
from geoutils import examples, projtools # noqa
Copy link
Member

@adehecq adehecq Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't want to load the raster submodule explicitly anymore? Is there nothing else to get apart from the Raster and Mask classes? (I'm not fully familiar with the new structure yet)
In any case, it is a bit strange that the change occurs with this PR and not in the restructuring one. But I understand that we may have forgotten this earlier.

@@ -28,6 +28,7 @@
from rasterio.enums import Resampling
from rasterio.plot import show as rshow

import geoutils as gu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it not lead to a circular import?? Would from geoutils import Vector be better?

from rasterio.coords import BoundingBox
from shapely.geometry.base import BaseGeometry

import geoutils as gu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

from geoutils._typing import ArrayLike, NDArrayNum, Number
from geoutils.interface.gridding import _grid_pointcloud

# from geoutils.raster.sampling import subsample_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete maybe?

return crs, nb_points, bounds, columns_names


# def _write_laspy(filename: str, pc: gpd.GeoDataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented? Not working?

self._bounds: BoundingBox
self._data_column: str
self._data: NDArrayNum
self._nb_points: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I would have called this "count", which I think is often used by other tools. But this may be ambiguous with the Raster.count which is about band count.

self._name: str | None = None
self._crs: CRS | None = None
self._bounds: BoundingBox
self._data_column: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general note. I don't find the names "data_column" and "all_columns" very intuitive. How did you pick the names? Hence I am not sure to understand what is which, even with the PR description. This will most likely be clearer after reading the code, but it should be fairly intuitive to the user without searching through doc or code.

I don't know what is the most used term for this.
In OGR this was named "fields" and similarly in QGIS. So in the old geoutils, we stuck to this term.
Would it also not be easier to save a "fields" attribute as a panda dataframe, so the column names and data can be retrieved from one object instead of two?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looking at the code below it is a bit more clear the meaning of the "columns" attributes. I understand that it originates from Pandas columns.
I also realize that in the Vector class (and so in PointCloud too), the field attributes are not directly accessible from the Vector object, but from Vector.ds only. There is also no equivalent to the "all_columns" attribute in Vector.
Sorry I don't have a perfect solution to give, but maybe we should discuss this more and make sure everything is clear and consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another remark. If we assume that a pointcloud is always x, y, z, why. not call "data_column" the "z_column"?

if self.is_loaded:
return super().crs
else:
return self._crs
Copy link
Member

@adehecq adehecq Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we can't always use self._crs? Same for bounds below.


if new_data_column not in self.all_columns:
raise ValueError(
f"Data column {new_data_column} not found among columns, available columns "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a break. "not found among columns, available columns..." -> "not found among columns. Available columns..."

resampling: Literal["nearest", "linear", "cubic"],
dist_nodata_pixel: float = 1.0,
) -> gu.Raster:
"""Grid point cloud into a raster."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the arguments should be defined, no?


return gu.Raster.from_array(data=array, transform=transform, crs=self.crs, nodata=None)

# def subsample(self, subsample: float | int, random_state: int | np.random.Generator | None = None) -> PointCloud:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not working?

return self.ds.geometry

@property
def columns(self) -> pd.Index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be consistent with the PointCloud class, which uses all_columns?

with pytest.raises(ValueError, match="Data column column_that_does_not_exist not found*"):
pc.set_data_column("column_that_does_not_exist")

def test_pointcloud_equal(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is empty 😉

def test_from_tuples(self) -> None:
"""Test building point cloud from list of tuples."""

pc1 = PointCloud(self.gdf1, data_column="b1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that you have to specify the data column each time makes me think. By default, if there is only one column, should it not pick that one?

Copy link
Member

@adehecq adehecq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very neat and useful addition to the existing set of classes, thanks a lot !!
Overall, this is well thought and I agree with most aspects. I only have some concerns about the attribute names "data_column" and "all_columns" (see my specific comment about it).
I will reply to specific questions separately.

@adehecq
Copy link
Member

adehecq commented Feb 6, 2025

Left to-do for this PR

Small changes:

Is this planned for a future PR I guess? Don't forget to turn this into a new issue in that case.

@adehecq
Copy link
Member

adehecq commented Feb 6, 2025

Or we do not add such a functionality and always leave this to the user, which remains fairly short as long as we support Raster/Raster arithmetic and PointCloud/PointCloud arithmetic:

Regarding the arithmetic operations, I would make sure they work within objects of the same class, but not within objects of different classes. Otherwise, this would force us to make some implicit choices and we don't want "naive" users to do things without understanding it. Users need to understand what it implies to take the difference between a DEM and PC. I think we should only facilitate these operations, and as you show, if it takes only 2 lines to do it, that's perfectly fine. It should just be documented so that the users find the steps easily.
It is the same question as to allow differencing two Rasters that are on a different grid. We could make some implicit assumptions and allow it, but we don't.

@adehecq
Copy link
Member

adehecq commented Feb 6, 2025

  • Add pointcloud to pointcloud comparison with non-equal coordinates by grouping closest points within a certain tolerance,

  • Add other common gridding/interpolation methods for points, or for raster-point comparison: IDW, kriging.

  • Mirror the PointCloud class in a GeoPandas accessor (technically Pandas accessor, see here:

All of this would be neat, especially the IDW and kriging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a PointCloud subclass of Vector to consistently deal with points?
3 participants