Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make node constructors public, add RTree::new_from_root #109

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nickguletskii
Copy link
Contributor

  • I agree to follow the project's code of conduct.
  • I added an entry to rstar/CHANGELOG.md if knowledge of this change could be valuable to users.

This PR makes some internal APIs public to allow users to implement custom tree serialization and deserialization. I am currently using this functionality to save and load large rstar RTrees to a custom compact representation in order to deliver them over the network.

@adamreichold
Copy link
Member

adamreichold commented Jan 27, 2023

While there is no unsafe code in this crate, should this document some of the expected invariants for these constructors which should be upheld to avoid unexpected panics down the road?

@nickguletskii
Copy link
Contributor Author

@adamreichold Thanks, completely forgot about that. I've added some warnings to the documentation of RTree::new_from_root to dissuade users from using this method to load custom/modified/spliced trees. I've also adding a warning that the trees may not be compatible between different versions of this crate. This should prevent this method from becoming an additional maintenance burden.

@urschrei
Copy link
Member

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

@adamreichold
Copy link
Member

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer and Deserializer implementations using the existing serde support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.

@urschrei
Copy link
Member

In that case I'm happy to continue to let it bake for now.

@nickguletskii
Copy link
Contributor Author

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer and Deserializer implementations using the existing serde support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.

Sorry for the late reply.

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

@rmanoka
Copy link
Contributor

rmanoka commented Jun 13, 2023

I'm also for merging this as is; it helps a reasonable use-case mentioned above.

@nickguletskii could you provide us more details on the compact representation that required custom serialization?

@adamreichold
Copy link
Member

adamreichold commented Jun 13, 2023

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

My comment was alluding to the serializer, not the Serialize impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existing Serialize impl of RTree. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.

@nickguletskii
Copy link
Contributor Author

I'm also for merging this as is; it helps a reasonable use-case mentioned above.

@nickguletskii could you provide us more details on the compact representation that required custom serialization?

Sure! I'm storing leafs separately from the tree inside an arena, and serializing only the topology of the tree (naively or using a something like a Prüfer code), without the AABBs. During deserialization, I reconstruct the AABBs on-the-fly. This significantly reduces the size of the tree when transferred over the network, and reduces the CPU time and allocations necessary to load it in comparison to performing bulk_load on the source data.

@adamreichold
Copy link
Member

During deserialization, I reconstruct the AABBs on-the-fly.

This sounds like something we'd want to integrate into the generic Serde implementation.

@nickguletskii
Copy link
Contributor Author

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

My comment was alluding to the serializer, not the Serialize impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existing Serialize impl of RTree. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.

I think it should be possible to implement a Deserializer that only implements deserialize_struct and deserialize_u64 to deserialize the RTreeStruct. However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing. Besides, what if someone is using something like rkyv instead of serde? serde support is behind a feature flag in this crate.

This sounds like something we'd want to integrate into the generic Serde implementation.

I'll try to find some time to contribute and benchmark a custom implementation of Serialize and Deserialize for RTree that skip AABBs.

However, I would still appreciate new_from_root being upstreamed because removing AABBs from the serialized trees is still only a part of what I'm doing. The other optimizations are probably too domain-specific and integrating them would be impractical and involve a lot of breaking changes.

@nickguletskii nickguletskii force-pushed the feature/pub-parentnode-constructors branch from cd05425 to 14d859d Compare June 13, 2023 20:10
@adamreichold
Copy link
Member

However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.

I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.

Consider the description of your approach in #109 (comment), do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?

(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)

@nickguletskii
Copy link
Contributor Author

However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.

I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.

Consider the description of your approach in #109 (comment), do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?

(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)

The current API proposed in this PR already doesn't allow the user to provide bogus envelopes for parent nodes, since ParentNode::new_parent calculates its envelope on its own. This part is actually safer than, for example, deserializing a tree from serde right now, because the deserializer copies the serialized, potentially bogus envelope without recalculating it from the children. The warning about AABBs in the documentation warns the user against changing the leaf envelopes, e.g. reusing the tree after rotating the geometry indexed by the RTree. There's no way to verify that this hasn't happened, just like there's no way to verify that the user has implemented RTreeObject::envelope correctly (i.e. it remains immutable after the object is inserted into a tree).

The tricky part is checking the RTree invariants. We could always call something like the sanity check used in the tests (sans the envelope checks), but that would be an extra tree traversal.

Alternatively, we could make a node builder struct that would keep track of node subtree depth and verify the child counts. This is how I would probably do this if I wanted to allow the user to build their own RTrees manually. That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar. If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.

@adamreichold
Copy link
Member

That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar.

I understand that, but we have to consider what happens when such an API is made available, e.g. what kind of bug reports we will be getting by opening this door. There are costs to you-are-holding-it-wrong API and hence I have a strong desire for safer alternatives.

Especially since thinking back when this was an unsafe API, there are already ways to circumvent the type system which do not need upstream support:

struct MyRTreeLookalike {
 root: rstar::ParentNode<MyObject>,
 size: usize,
}

let tree = unsafe { std::mem::transmute::<rstar::RTree<MyObject, rstar::DefaultParams>>(MyRTreeLookalike { root, size }) };

(assuming that new_root and new_parent were public).

Of course, the ergonomics of this are horrible as you now have to ensure that MyRTreeLookalike actually stays layout-compatible with the actual RTree type, but the problem does stay local to the crate in question.

Of course, at the end of day, I will probably just throw in the towel and stop arguing against merging this, because as you point out even if we do add a safe API to build up arbitrary trees, then

If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.

which means that we still have to be very very careful if we ever write any unsafe code in this crate to rely on nothing but the invariants that we can actually check through these API which might still be too high a cost when taking into account that someone has to build them. (I think it would be unfair to ask that of you because this was not what you were trying to achieve.)

@rmanoka
Copy link
Contributor

rmanoka commented Jun 15, 2023

Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:

  1. pre-order traversal of the leaf nodes, to make a vector
  2. store the split indices / equivalent info of the tree-hierarchy to fast-load
  3. provide pub interface to fast-load an r-tree from (Vec, split-indices)

@nickguletskii Do you think you can contribute this / something similar? Would be quite useful for many others I believe. Would also circumevent the issues pointed above with wanting to make internal functions pub.

On the pub issue, how about an unsafe pub wrapper around an internal pub(crate); this way, we'll only need one unsafe usage. May be we could try to just allow(unsafe_code) in one module if that's possible.

@adamreichold
Copy link
Member

Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:

I think what you propose would certainly be nice as an efficency improvement to our internal serialization support, but I think it goes beyond the linked implementations as those just build up the tree structures upon deserialization using the public API which has considerable overhead. I think the equivalent would be using bulk_load for deserialization which is not what is requested here.

Finally, the API taking the vector of leaves and split indices basically has the same issue as the one proposed here if made public. It is not clear if the resulting tree is a valid R* tree and therefore which invariants other code in this crate can assume when operating on this tree.

On the pub issue, how about an unsafe pub wrapper around an internal pub(crate); this way, we'll only need one unsafe usage. May be we could try to just allow(unsafe_code) in one module if that's possible.

I'll just withdraw my objections. As long as we have no unsafe code elsewhere, the possible fallout of this is random panics or incorrect results when someone constructs an invalid tree using this API. When we want to use unsafe code in the implementation, we should probably reconsider whether to mark this function unsafe, but then we should also be able to explicitly describe the properties it must uphold so that our unsafe code can rely on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants