Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance for multipolygon contains much worse than for geos #1184

Open
gauteh opened this issue May 18, 2024 · 17 comments
Open

performance for multipolygon contains much worse than for geos #1184

gauteh opened this issue May 18, 2024 · 17 comments

Comments

@gauteh
Copy link

gauteh commented May 18, 2024

Hi,

I'm trying to convert some code for roaring-landmask (https://github.com/gauteh/roaring-landmask) from geos to geo (gauteh/roaring-landmask#27). However, checking if a point is contained by a large multipolygon (with all countries from GSHHG) is far slower with geo than with geos. With geos I used PreparedGeometry, does something similar exist?

geos:

test shapes::tests::benches::test_contains_in_ocean ... bench:       1,586 ns/iter (+/- 45)
test shapes::tests::benches::test_contains_on_land  ... bench:       1,499 ns/iter (+/- 169)

geo:

test shapes::tests::benches::test_contains_in_ocean ... bench:  10,713,374 ns/iter (+/- 356,454)
test shapes::tests::benches::test_contains_on_land  ... bench:   6,709,172 ns/iter (+/- 255,050)
@urschrei
Copy link
Member

No, but you can decompose the multipolygon into polygons and add them to an rstar spatial index: https://docs.rs/rstar/latest/rstar/struct.RTree.html. The rtree is serializable, re-exported by geo (so it should "just work" for bulk loading the multipolygon Vec), and you can associate extra data with each polygon by using https://docs.rs/rstar/latest/rstar/primitives/struct.GeomWithData.html

@gauteh
Copy link
Author

gauteh commented May 18, 2024

Thanks, for the tip. Do you know if that is how geos is doing it?

@urschrei
Copy link
Member

A geos prepared geometry is an rtree behind the scenes, yes. So things like point in multipolygon query the tree transparently.

@gauteh
Copy link
Author

gauteh commented May 19, 2024

I'm trying to add all my Polygons to an RTree, and then check if a point is contained by any of the polygons. But I'm getting this error:

   Compiling roaring-landmask v0.7.3 (/home/gauteh/met/roaring-landmask)
error[E0599]: the method `nearest_neighbor` exists for struct `RTree<Polygon>`, but its trait bounds were not satisfied
  --> src/shapes.rs:74:28
   |
74 |         let po = self.geom.nearest_neighbor(&p);
   |                            ^^^^^^^^^^^^^^^^ method cannot be called on `RTree<Polygon>` due to unsatisfied trait bounds
   |
  ::: /home/gauteh/.cargo/registry/src/index.crates.io-6f17d22bba15001f/geo-types-0.7.13/src/geometry/polygon.rs:71:1
   |
71 | pub struct Polygon<T: CoordNum = f64> {
   | ------------------------------------- doesn't satisfy `geo::Polygon: PointDistance`
   |
   = note: the following trait bounds were not satisfied:
           `geo::Polygon: PointDistance`

gauteh/roaring-landmask@b8e3a21

How can I search through the Polygons and check if the point is contained?

@gauteh
Copy link
Author

gauteh commented May 19, 2024

I think Polygons should implement contains_point, but not PointDistance. Only contains_point and envelope-stuff is necessary for this check? Maybe PointDistance needs to be split.

@gauteh
Copy link
Author

gauteh commented May 19, 2024

I wrapped the polygons and implemented contains (gauteh/roaring-landmask@60bc592)

The performance is better, but still roughly 1000x worse than geos :/

test shapes::tests::benches::test_contains_in_ocean ... bench:   1,403,390 ns/iter (+/- 175,985)
test shapes::tests::benches::test_contains_on_land  ... bench:   1,395,731 ns/iter (+/- 111,216)

@urschrei
Copy link
Member

Can you post a link to your benchmark? I just want to make sure I'm understanding everything here.

@gauteh
Copy link
Author

gauteh commented May 19, 2024

Hi, the benchmark is here: https://github.com/gauteh/roaring-landmask/blob/685ed6f43e944a604545c452273ded4ab901ce42/src/shapes.rs#L201

You run it with: cargo bench shapes --features nightly, requires python at least 3.10. main branch is with geos, the linked PR is with geo.

Thanks

@kylebarron
Copy link
Member

FWIW I believe a GEOS prepared geometry indexes every coordinate pair (Line in geo terminology). This is vastly different than indexing each polygon, especially for large polygons like country/land boundaries. I think what you want is #803.

I think that geo will automatically build in-effect a prepared geometry before doing the intersection, but there's currently no API to allow the user to persist this. So if you're doing repeated intersections I'd expect bad performance compared to GEOS

@gauteh
Copy link
Author

gauteh commented May 21, 2024

Thanks, that seems like something that needs to be done on geo level. It appears to have got some work, but still require some effort?

@gauteh
Copy link
Author

gauteh commented Jun 6, 2024

Would it be enough to persist/cache this intermediate step? Any way to do a quick test if that makes any difference?

@urschrei
Copy link
Member

urschrei commented Jun 6, 2024

You can enable serialise / deserialise support for trees by enabling the serde feature on the rstar dependency. I haven't done it myself, but it should be pretty straightforward using something like https://docs.rs/bincode/latest/bincode/

@gauteh
Copy link
Author

gauteh commented Jun 6, 2024

It's not the rstar, but what is mentioned here: #1184 (comment) . Rstar didn't help enough.

@michaelkirk
Copy link
Member

Any way to do a quick test if that makes any difference?

As discussed in #803, https://github.com/georust/geo/compare/mkirk/prepared-geom?expand=1 could be a good starting point.

@gauteh
Copy link
Author

gauteh commented Aug 16, 2024

Now testing out prepared geometries through #1198, is it better if I create an RTree with prepared polygons, or should I just create one prepared multipolygon? I am testing many points against all the polygons.

@gauteh
Copy link
Author

gauteh commented Aug 18, 2024

Using a RTree of prepared geometries (gauteh/roaring-landmask@e413778), slightly worse:

test shapes::tests::benches::test_contains_in_ocean ... bench:  41,531,670.10 ns/iter (+/- 4,594,429.18)
test shapes::tests::benches::test_contains_on_land  ... bench:  42,088,323.30 ns/iter (+/- 3,958,897.37)

@airbreather
Copy link

airbreather commented Oct 3, 2024

I've worked on NTS, the C# port of JTS (whereas GEOS is the C++ port of JTS). Since it looks like part of this discussion focuses SPECIFICALLY on the "contains point" case, while it's true that JTS / NTS / GEOS use an R-tree for this, there's some nuance that I don't see being discussed here (it's somewhat addressed here, but it's the details of that index that make a huge difference). If I've scanned too quickly and missed it, please ignore me...

For single-point tests specifically, it's NOT a typical 2-dimensional R-tree holding all the line segments. Rather, it uses a 1-dimensional "interval R-tree" of those segments.

Mathematically, calculating "where is (x, y) located relative to this (multi/)polygon?" is done by taking a ray that starts at (x, y), following it to infinity in any direction, and counting how many times it crosses a polygon ring segment. Ignoring the "exactly on the boundary" case for this discussion (not because it's unimportant generally, but because it doesn't affect indexing), if the result is even (including zero), then the point is outside. Otherwise, it's inside.

The math doesn't care which direction we choose, but there are four interesting ones: positive / negative, parallel to the x / y axis. GEOS happens to pick positive X, so I'll talk about that specifically, but it can work with any of the four.

Once you commit to choosing choosing that direction, you can index all of the line segments of the (multi/)polygon using a variant of the R-tree that only looks at their min/max Y values. With that index, a query for point (x, y) can cheaply discard any segments that can't possibly influence the result because they're either fully above or fully below the query point. What's more, you can very cheaply rule out a lot of the remaining segments as well by just checking if they're fully on the other side of the ray from your query point, leaving a very small subset of the segments where you have to run the more expensive test, almost all of which actually do need to influence the result in some way, almost all of the time.

I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants