distance_2 is problematic for integer types #48

Bytekeeper · 2020-08-10T14:57:13Z

Ie. a distance of 1,000,000 fits into i32, but squared it will not. Maybe have an associated type for the return value, defaulting to Point::Scalar?

The text was updated successfully, but these errors were encountered:

jverce · 2020-08-21T00:14:21Z

Having an associated value can help for some cases, but wouldn't that be "postponing" the problem?

Let's say that we do something like this:

let point_a: [i32; 2] = [-2147483648, -2147483648]; // (-2^31, -2^31)
let point_b: [i32; 2] = [2147483647, 2147483647];   // (2^31-1, 2^31-1)
let points = vec![point_a, point_b];
let t: RTree<[i32; 2]> = RTree::bulk_load(points);

// Setting the return type to i64 to fit the intermediate result
let nn: Option<[i64; 2]> = t.nearest_neighbor(&point_a);

The intermediate distance computation wouldn't fit either, and at this point we've run out of options since i64 is the largest integer type we have at our disposal, so we're back to square 1.

Disclaimer from my part: dealing with integers is a pain when you have arbitrarily large values.
Off the top of my head the available options at this point are:

Use an adjustable type for the return value as suggested, and add either static or dynamic checks to use the correct type so as to avoid overflows
Use a numeric library like the one in the num_bigint crate
Remove integers altogether and use f32 and f64

Bytekeeper · 2020-08-23T09:07:21Z

One could use i128, but I see the problem.
Removing integers: You mean for distance_2 or generally?

Btw. maybe my issue is invalid? After all, I could have used i64 to begin with. (And in my use case values never go beyond 64k).

jverce · 2020-08-23T19:18:43Z

Re: removing integers, I meant generally throughout the library. That would allow users to handle arbitrarily large or small numbers without caring for the types of any intermediate calculations. But going through this route can be risky, there's a lot of things to consider.

As you said, the thing users can do now is use the data type that can handle both the data itself and the intermediate computations that the library performs. But this can be improved IMHO (after all, users should not have to deal with the internals of the library), so I wouldn't flag this issue as invalid.

Bytekeeper · 2020-08-24T07:45:10Z

The problem with using even f64 for internal calculations is, that there's a loss of precision:
println!("{}", 499999999000000001_u64 as f64); yields 499999999000000000

I can imagine that for some use cases this bad. And num_bigint might affect performance?
As long as a user can provide their own distance_2 implementation I guess it's not a problem?

jverce · 2020-08-24T23:50:35Z

Yeah, there are several cases in which floating point manifest their weaknesses.

@Stoeoef mentioned in spade that the project wasn't focused on performance but rather "on a rich and high quality feature set". Maybe that's the focus for this library as well? If so, we might be allowed to consider other libraries to handle numeric types and abstract the details away from users.

We could also allow users to provide their own distance_2 implementations. This could be super useful in cases where users are dealing with data points that are not in the Euclidean space for example.

adamreichold · 2020-08-25T08:39:49Z

If so, we might be allowed to consider other libraries to handle numeric types and abstract the details away from users.

In my use case, performance is quite important as our spatially explicit simulations often spend almost a third of their CPU time on spatial index look-ups. Hence, I am personally rather keen on making the choice of the appropriate data type, e.g. double or single precision floating-point in addition to integer or floating point, and would very probably not be able to stomach the performance hit of using arbitrary precision integers internally.

We could also allow users to provide their own distance_2 implementations. This could be super useful in cases where users are dealing with data points that are not in the Euclidean space for example.

We already do via the PointDistance trait. Personally, I am using this crate for a model that works in polar coordinates and implement PointDistance::distance_2 using the law of cosines, e.g.

fn distance_2(
    &self,
    point: &Point
) -> f64 {
    self.r.powi(2) + point.r.powi(2) - 2. * self.r * point.r * (self.phi - point.phi).cos()
}

jverce · 2020-08-31T14:17:12Z

Interesting...
I'd be happy to try different solutions and perform benchmarks on them. Maybe we can get away with both use cases 😄
I'll keep you posted.

Stoeoef · 2021-08-03T21:42:36Z

The support for integer types is something I'd like to keep - they work really well with rtrees those don't require any multiplication or division operations.

However, I do acknowledge that this is problematic for large values.
I'm unfortunately not sure how well rstar can support large integer values. IMO, the only realistic workaround in regards to performance is using f64, despite the issue that not all i64 are representable.

Something like BigInteger really expensive in my experience, especially since the distance is calculated over and over again. Also, not having Copy really hurts maintainability. That's why I'm sceptical about support for that unless the performance is - despite my expectation - actually not too bad (in which case this would become an option)

For now, the most cost efficient solution would IMO be:

Document that using integers comes with the risk of overflows and should only be done if the input value range is limited to a certain range (probably +-2 ^ 32 for i64?)
Attempt to call checked_mul when using integer multiplication and panic on overflows (even in release mode)
Not use any overflow checks for floats.

maboesanman · 2021-09-22T19:41:06Z

Is it enough for distance_2 to return a type whose ord implementation agrees with Pythagorean distance?

X is the bits, I.e. 32 or size
you could have distance_2::<[u|i]X>(a,b) -> (high, middle, low) where high middle and low are all uX.

you can do this without worrying about overflow by doing a single iteration of karatsuba multiplication (like this https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3fa5db12977a74edc7c4d79feb70ef8b)

Then you impl ord by comparing the high, mid, low of self and other in sequence.

this can be done completely generic over the number type

hlinnaka mentioned this issue Sep 19, 2022

R-Tree layer map neondatabase/neon#2317

Merged

urschrei mentioned this issue Sep 26, 2023

Can I use rstar for geographic points & use a great circle distance? #135

Open

adamreichold mentioned this issue Dec 24, 2023

Need to use different types for Point::Scalar and square #145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distance_2 is problematic for integer types #48

distance_2 is problematic for integer types #48

Bytekeeper commented Aug 10, 2020

jverce commented Aug 21, 2020

Bytekeeper commented Aug 23, 2020

jverce commented Aug 23, 2020

Bytekeeper commented Aug 24, 2020

jverce commented Aug 24, 2020

adamreichold commented Aug 25, 2020

jverce commented Aug 31, 2020

Stoeoef commented Aug 3, 2021

maboesanman commented Sep 22, 2021

distance_2 is problematic for integer types #48

distance_2 is problematic for integer types #48

Comments

Bytekeeper commented Aug 10, 2020

jverce commented Aug 21, 2020

Bytekeeper commented Aug 23, 2020

jverce commented Aug 23, 2020

Bytekeeper commented Aug 24, 2020

jverce commented Aug 24, 2020

adamreichold commented Aug 25, 2020

jverce commented Aug 31, 2020

Stoeoef commented Aug 3, 2021

maboesanman commented Sep 22, 2021