-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PointCloud
subclass of Vector
#492
base: main
Are you sure you want to change the base?
Conversation
@adehecq @atedstone @erikmannerfelt This PR is now advanced enough to get your feedback! 😊 All tests passing locally with a LAS file I have. Will need to add one to |
@rhugonnet , this looks exceptionally well thought through. I read your descriptions and the changed files,but have not tried out the functionality myself as I don't immediately have files lying around that would be a good test case. But from my side, this looks good to progress to finishing the outstanding tasks you listed. Concerning the bigger possible change about arithmetic ops between Raster and PointCloud: my instinct would be to require the user to decide, so that the comparison is explicit and clear. |
@@ -2,10 +2,12 @@ | |||
GeoUtils is a Python package for the analysis of geospatial data. | |||
""" | |||
|
|||
from geoutils import examples, pointcloud, projtools, raster, vector # noqa | |||
from geoutils import examples, projtools # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't want to load the raster submodule explicitly anymore? Is there nothing else to get apart from the Raster and Mask classes? (I'm not fully familiar with the new structure yet)
In any case, it is a bit strange that the change occurs with this PR and not in the restructuring one. But I understand that we may have forgotten this earlier.
@@ -28,6 +28,7 @@ | |||
from rasterio.enums import Resampling | |||
from rasterio.plot import show as rshow | |||
|
|||
import geoutils as gu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it not lead to a circular import?? Would from geoutils import Vector
be better?
from rasterio.coords import BoundingBox | ||
from shapely.geometry.base import BaseGeometry | ||
|
||
import geoutils as gu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
from geoutils._typing import ArrayLike, NDArrayNum, Number | ||
from geoutils.interface.gridding import _grid_pointcloud | ||
|
||
# from geoutils.raster.sampling import subsample_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to delete maybe?
return crs, nb_points, bounds, columns_names | ||
|
||
|
||
# def _write_laspy(filename: str, pc: gpd.GeoDataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented? Not working?
self._bounds: BoundingBox | ||
self._data_column: str | ||
self._data: NDArrayNum | ||
self._nb_points: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I would have called this "count", which I think is often used by other tools. But this may be ambiguous with the Raster.count
which is about band count.
self._name: str | None = None | ||
self._crs: CRS | None = None | ||
self._bounds: BoundingBox | ||
self._data_column: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general note. I don't find the names "data_column" and "all_columns" very intuitive. How did you pick the names? Hence I am not sure to understand what is which, even with the PR description. This will most likely be clearer after reading the code, but it should be fairly intuitive to the user without searching through doc or code.
I don't know what is the most used term for this.
In OGR this was named "fields" and similarly in QGIS. So in the old geoutils, we stuck to this term.
Would it also not be easier to save a "fields" attribute as a panda dataframe, so the column names and data can be retrieved from one object instead of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looking at the code below it is a bit more clear the meaning of the "columns" attributes. I understand that it originates from Pandas columns.
I also realize that in the Vector class (and so in PointCloud too), the field attributes are not directly accessible from the Vector object, but from Vector.ds
only. There is also no equivalent to the "all_columns" attribute in Vector.
Sorry I don't have a perfect solution to give, but maybe we should discuss this more and make sure everything is clear and consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another remark. If we assume that a pointcloud is always x, y, z, why. not call "data_column" the "z_column"?
if self.is_loaded: | ||
return super().crs | ||
else: | ||
return self._crs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we can't always use self._crs
? Same for bounds below.
|
||
if new_data_column not in self.all_columns: | ||
raise ValueError( | ||
f"Data column {new_data_column} not found among columns, available columns " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a break. "not found among columns, available columns..." -> "not found among columns. Available columns..."
resampling: Literal["nearest", "linear", "cubic"], | ||
dist_nodata_pixel: float = 1.0, | ||
) -> gu.Raster: | ||
"""Grid point cloud into a raster.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the arguments should be defined, no?
|
||
return gu.Raster.from_array(data=array, transform=transform, crs=self.crs, nodata=None) | ||
|
||
# def subsample(self, subsample: float | int, random_state: int | np.random.Generator | None = None) -> PointCloud: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not working?
return self.ds.geometry | ||
|
||
@property | ||
def columns(self) -> pd.Index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it not be consistent with the PointCloud
class, which uses all_columns
?
with pytest.raises(ValueError, match="Data column column_that_does_not_exist not found*"): | ||
pc.set_data_column("column_that_does_not_exist") | ||
|
||
def test_pointcloud_equal(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is empty 😉
def test_from_tuples(self) -> None: | ||
"""Test building point cloud from list of tuples.""" | ||
|
||
pc1 = PointCloud(self.gdf1, data_column="b1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that you have to specify the data column each time makes me think. By default, if there is only one column, should it not pick that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very neat and useful addition to the existing set of classes, thanks a lot !!
Overall, this is well thought and I agree with most aspects. I only have some concerns about the attribute names "data_column" and "all_columns" (see my specific comment about it).
I will reply to specific questions separately.
Is this planned for a future PR I guess? Don't forget to turn this into a new issue in that case. |
Regarding the arithmetic operations, I would make sure they work within objects of the same class, but not within objects of different classes. Otherwise, this would force us to make some implicit choices and we don't want "naive" users to do things without understanding it. Users need to understand what it implies to take the difference between a DEM and PC. I think we should only facilitate these operations, and as you show, if it takes only 2 lines to do it, that's perfectly fine. It should just be documented so that the users find the steps easily. |
All of this would be neat, especially the IDW and kriging! |
This PR adds the
PointCloud
class to facilitate interfacing with this special sub-type of vector data very common in geospatial analysis.In short, a point cloud is a subclass of
Vector
containing only 2D point geometries and associated to a main data column (+ optionally other auxiliary data columns). This main data column represents the values.data
of the point cloud, that can be compared to the.data
of a raster format (which is impossible for a typical vector), or of that of another point cloud if some of their coordinates are the same, or within close tolerance.Additionally, all functionalities of
Raster
using.data
that just need coordinates can then be ported toPointCloud
(array interface with NumPy, subsampling, binning, zonal stats, variography, etc).Summary
The idea behind the implementation of the
PointCloud
class is two-fold:load
mechanism asRaster
and support chunked reading.For this, we add the new optional dependency
laspy
for reading and writing LAS-type format. And we override some of the behaviour ofVector
inPointCloud
to allow for aload()
functionality.Details
This PR adds the
PointCloud
class, including:data_column
attribute containing the data column name of the point cloud,data
attribute corresponding to the data of the point cloud,all_columns
attribute containing all data columns of the point cloud (Vector.columns
also containsgeometry
),is_loaded
attribute andload()
method for when data is read from a LAS file,nb_points
attribute returning the number of points in the point cloud,from_array()
,from_tuples()
andfrom_xyz()
methods to easily create a point cloud from different inputs, and their equivalentto_array()
,to_tuples()
andto_xyz()
,pointcloud_equal()
method simply checkingvector_equal()
and that thedata_column
is equal as well,grid()
functionality as a class method.The new
PointCloud
class overrides the attributesds
,crs
,bounds
andcolumns
to allow loading only metadata, and implicitly loading the point cloud.This PR updates existing input/output of some functions:
Raster.to_pointcloud()
now returns aPointCloud
instead of aVector
.Left to-do for this PR
Small changes:
geoutils-data
to run tests,Raster.from_regular_pointcloud()
to accept aPointCloud
input,Raster.interp_points()
andRaster.reduce_points()
to accept aPointCloud
as input for coordinates,Vector.create_mask()
to accept aPointCloud
the manipulation of masking and booleans for point clouds,.data
(same whether 1D or 2D) to point cloud:subsample()
,get_stats()
(and function that work on 1D sets of coordinates likevariogram()
, once moved to GeoUtils).__array_interface__
support for point clouds with the same coordinates, simply pointing to that ofself.ds[data_column]
?Potential bigger change?
This last change could be to:
Raster
andPointCloud
? (e.g., defaulting to usinginterp_points
at thePointCloud
coordinates, the comparison method could be tweaked ingeoutils.config
),It could work as:
The only problem I see is that interpolation is not ideal for a point cloud much denser than a raster (e.g., a lidar point cloud). In that case, it's better to grid the point cloud into a raster, and compare the two rasters. But that is computationally-intensive, so happening under-the-hood is not ideal.
We could have a criteria based on point cloud density relative to the raster size:
Or we do not add such a functionality and always leave this to the user, which remains fairly short as long as we support
Raster
/Raster
arithmetic andPointCloud
/PointCloud
arithmetic:And add a public function to calculate relative point density to help users decide when need be?
Future functionalities
PointCloud
class in a GeoPandas accessor (technically Pandas accessor, see here: ENH: Addregister_geo(series|dataframe)_accessor
API (#1947) geopandas/geopandas#1952 (comment)).Resolves #463