Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk loading using space-filling curve algorithm(s) #102

Open
urschrei opened this issue Aug 24, 2022 · 6 comments
Open

Bulk loading using space-filling curve algorithm(s) #102

urschrei opened this issue Aug 24, 2022 · 6 comments

Comments

@urschrei
Copy link
Member

urschrei commented Aug 24, 2022

rstar currently supports bulk-loading using the OMT algorithm. However, using a Hilbert curve to order the nodes can result in a tree with better storage utilisation and less overlap. There is an extant implementation for an R tree at https://github.com/flatgeobuf/flatgeobuf/blob/master/src/rust/src/packed_r_tree.rs#L193 that should be relatively easily portable.

A related issue is then the creation of a new bulk-loading API which allows selection of the bulk loading algorithm, depending on the application.

@urschrei
Copy link
Member Author

Some related issues:

  1. OMT is top-down, and uses a heuristic based on the number of entries to determine height and number of items per node (p71). Hilbert (and other space-filling curve algorithms) are bottom-up, and seem to use a default height of 16, which allows us to similarly calculate node sizes. However, it's not clear where the magic "16" number was arrived at, and under what conditions one might wish to tune that.

  2. It would be great to know how much of the existing bulk-loading machinery (https://github.com/georust/rstar/blob/master/rstar/src/algorithm/bulk_load/cluster_group_iterator.rs, https://github.com/georust/rstar/blob/master/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs#L14-L87) could be re-used for the Hilbert bulk load. Is it enough to calculate the leaf data bbox Hilbert score, sort, then recursively insert using the new order? Some pointers from @Stoeoef about the existing bulk-load functionality would be really useful here.

@kylebarron
Copy link
Member

  1. it's not clear where the magic "16" number was arrived at, and under what conditions one might wish to tune that

Don't have a reference for this but I thought 16 might have been the maximum precision you can get with 32 bit coordinates

@Stoeoef
Copy link
Collaborator

Stoeoef commented Dec 29, 2022

I'm not sure if much of the existing functionality can be adapted to be honest - the top-down approach seems to be "too different" from the bottom up approach.

I don't think that this is a problem though - Hilbert curves are probably easier to implement (?) than the existing implementation anyway. The existing code is somewhat complex as it has to work in arbitrary dimensions. From memory, the current code successively partitions the elements along all axis and then recurses into the final clusters. For example:
For two dimensions with a MAX_SIZE of 6, the algorithm would

  • partition all elements along the x axis into 3 partitions
  • for each partition (called... slab? cluster group? in the code) , re-partition along the y-axis into 2 sub-partitions.

This results in 2 * 3 = 6 sub-partitions with known AABB that will form a parent node in the rtree. Then, each sub-partition is broken down recursively until the number of items is small enough to be put into a leaf node.

For Hilbert curves, higher dimensions don't seem to be a concern for the rtree - it only matters for calculating the Hilbert score. That's why I can well imagine that this implementation is going to be quite different and simpler.

I think the reason why I've decided against using space filling curves is that I've never really understood how to adapt them well to data that is heavily skewed. From my understanding, the Hilbert curve wouldn't be able to assign different Hilbert scores for elements that are very close together, especially in higher dimensions. I didn't look too deeply into that though - maybe this is more a theoretical concern that doesn't happen in practice?

@kylebarron
Copy link
Member

I think maybe it's worthwhile to split this into two issues? 1) a bulk loading API and 2) implementing OMT?

I think bulk loading is very useful for a lot of cases where you don't need the tree to be dynamic. I'm not familiar with the OMT algorithm; from the paper's introduction it seems they claim it's an improvement on STR?

I think the reason why I've decided against using space filling curves is that I've never really understood how to adapt them well to data that is heavily skewed

I've never worked with so heavily skewed data that a hilbert-packed r tree was a bad fit. I have noticed that hilbert-packed r tree implementations seem to first compute the total bounds of the data and constrain the scale of the hilbert values to that space. So if you're working with lon/lat data covering the US you don't need to compute hilbert values over [-180, -90, 180, 90], which helps with precision.

@urschrei
Copy link
Member Author

implementing OMT

For the avoidance of confusion: OMT is the current bulk-loading algorithm – it's just that the API is intertwined with the implementation because in the distant past rstar didn't have a bulk-loading API and I asked for one using OMT, which @Stoeoef duly implemented.

@kylebarron
Copy link
Member

Oh whoops I missed that a bulk loading API already existed 🤦‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants