Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distance_2 is problematic for integer types #48

Open
Bytekeeper opened this issue Aug 10, 2020 · 9 comments
Open

distance_2 is problematic for integer types #48

Bytekeeper opened this issue Aug 10, 2020 · 9 comments

Comments

@Bytekeeper
Copy link

Ie. a distance of 1,000,000 fits into i32, but squared it will not. Maybe have an associated type for the return value, defaulting to Point::Scalar?

@jverce
Copy link
Contributor

jverce commented Aug 21, 2020

Having an associated value can help for some cases, but wouldn't that be "postponing" the problem?

Let's say that we do something like this:

let point_a: [i32; 2] = [-2147483648, -2147483648]; // (-2^31, -2^31)
let point_b: [i32; 2] = [2147483647, 2147483647];   // (2^31-1, 2^31-1)
let points = vec![point_a, point_b];
let t: RTree<[i32; 2]> = RTree::bulk_load(points);

// Setting the return type to i64 to fit the intermediate result
let nn: Option<[i64; 2]> = t.nearest_neighbor(&point_a);

The intermediate distance computation wouldn't fit either, and at this point we've run out of options since i64 is the largest integer type we have at our disposal, so we're back to square 1.

Disclaimer from my part: dealing with integers is a pain when you have arbitrarily large values.
Off the top of my head the available options at this point are:

  • Use an adjustable type for the return value as suggested, and add either static or dynamic checks to use the correct type so as to avoid overflows
  • Use a numeric library like the one in the num_bigint crate
  • Remove integers altogether and use f32 and f64

@Bytekeeper
Copy link
Author

One could use i128, but I see the problem.
Removing integers: You mean for distance_2 or generally?

Btw. maybe my issue is invalid? After all, I could have used i64 to begin with. (And in my use case values never go beyond 64k).

@jverce
Copy link
Contributor

jverce commented Aug 23, 2020

Re: removing integers, I meant generally throughout the library. That would allow users to handle arbitrarily large or small numbers without caring for the types of any intermediate calculations. But going through this route can be risky, there's a lot of things to consider.

As you said, the thing users can do now is use the data type that can handle both the data itself and the intermediate computations that the library performs. But this can be improved IMHO (after all, users should not have to deal with the internals of the library), so I wouldn't flag this issue as invalid.

@Bytekeeper
Copy link
Author

The problem with using even f64 for internal calculations is, that there's a loss of precision:
println!("{}", 499999999000000001_u64 as f64); yields 499999999000000000

I can imagine that for some use cases this bad. And num_bigint might affect performance?
As long as a user can provide their own distance_2 implementation I guess it's not a problem?

@jverce
Copy link
Contributor

jverce commented Aug 24, 2020

Yeah, there are several cases in which floating point manifest their weaknesses.

@Stoeoef mentioned in spade that the project wasn't focused on performance but rather "on a rich and high quality feature set". Maybe that's the focus for this library as well? If so, we might be allowed to consider other libraries to handle numeric types and abstract the details away from users.

We could also allow users to provide their own distance_2 implementations. This could be super useful in cases where users are dealing with data points that are not in the Euclidean space for example.

@adamreichold
Copy link
Member

If so, we might be allowed to consider other libraries to handle numeric types and abstract the details away from users.

In my use case, performance is quite important as our spatially explicit simulations often spend almost a third of their CPU time on spatial index look-ups. Hence, I am personally rather keen on making the choice of the appropriate data type, e.g. double or single precision floating-point in addition to integer or floating point, and would very probably not be able to stomach the performance hit of using arbitrary precision integers internally.

We could also allow users to provide their own distance_2 implementations. This could be super useful in cases where users are dealing with data points that are not in the Euclidean space for example.

We already do via the PointDistance trait. Personally, I am using this crate for a model that works in polar coordinates and implement PointDistance::distance_2 using the law of cosines, e.g.

fn distance_2(
    &self,
    point: &Point
) -> f64 {
    self.r.powi(2) + point.r.powi(2) - 2. * self.r * point.r * (self.phi - point.phi).cos()
}

@jverce
Copy link
Contributor

jverce commented Aug 31, 2020

Interesting...
I'd be happy to try different solutions and perform benchmarks on them. Maybe we can get away with both use cases 😄
I'll keep you posted.

@Stoeoef
Copy link
Collaborator

Stoeoef commented Aug 3, 2021

The support for integer types is something I'd like to keep - they work really well with rtrees those don't require any multiplication or division operations.

However, I do acknowledge that this is problematic for large values.
I'm unfortunately not sure how well rstar can support large integer values. IMO, the only realistic workaround in regards to performance is using f64, despite the issue that not all i64 are representable.

Something like BigInteger really expensive in my experience, especially since the distance is calculated over and over again. Also, not having Copy really hurts maintainability. That's why I'm sceptical about support for that unless the performance is - despite my expectation - actually not too bad (in which case this would become an option)

For now, the most cost efficient solution would IMO be:

  1. Document that using integers comes with the risk of overflows and should only be done if the input value range is limited to a certain range (probably +-2 ^ 32 for i64?)
  2. Attempt to call checked_mul when using integer multiplication and panic on overflows (even in release mode)
  3. Not use any overflow checks for floats.

@maboesanman
Copy link

Is it enough for distance_2 to return a type whose ord implementation agrees with Pythagorean distance?

X is the bits, I.e. 32 or size
you could have distance_2::<[u|i]X>(a,b) -> (high, middle, low) where high middle and low are all uX.

you can do this without worrying about overflow by doing a single iteration of karatsuba multiplication (like this https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3fa5db12977a74edc7c4d79feb70ef8b)

Then you impl ord by comparing the high, mid, low of self and other in sequence.

this can be done completely generic over the number type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants