Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel KD-Tree construction #165

Open
B1ueber2y opened this issue Feb 11, 2022 · 12 comments
Open

Parallel KD-Tree construction #165

B1ueber2y opened this issue Feb 11, 2022 · 12 comments

Comments

@B1ueber2y
Copy link

Hey thanks for the great project! I am wondering if there is any easy way (or forked project) to enable parallelism for KD-Tree construction with nanoflann? Or do you have plans to integrate this feature in the future?

@AlphaHot
Copy link

What is the MatrixType & coeff?

@Mintpasokon
Copy link
Contributor

Hey, I forked the project and made a commit on building nanoflann::KDTreeBaseClass concurrently, which works by adding an extra parameter in KDTreeSingleIndexAdaptorParams (by default it is 1). In my tests, concurrent builds with 8-16 threads can improve building performance by up to around 3x.
See docs about the parameter:
https://github.com/Mintpasokon/nanoflann-concurrent-build/tree/concurrent_build#23-kdtreesingleindexadaptorparamsn_thread_build
I was wondering if I could propose a PR for the project. Please let me know if you have any suggestions or expectations.@jlblancoc

@jlblancoc
Copy link
Owner

I totally love your changes and how you approached it 👍

I checked it with valgrind and gcc sanitizer and both are happy, so I am. Well, actually valgrind --tool=helgrind tests/unit_tests complains about "potential" data races, but I think they are false positives due to the use of atomics. Take a look at it (build with "-g" for debug symbols) if you want first.

Then, yes, feel free of opening a PR, after adding the corresponding line (and your name at the end?) to the CHANGELOG.

Thanks!

@Mintpasokon
Copy link
Contributor

Mintpasokon commented Mar 31, 2023

Thank you for your reply! I have submitted a PR with some explanation about helgrind. If you have any suggestions or concerns, please let me know. Thanks again!

@jlblancoc
Copy link
Owner

Closing, this feature was already merged and is released in the latest v1.5.0

@dokempf
Copy link
Contributor

dokempf commented Jun 22, 2023

@Mintpasokon Can you share a bit of details on the dataset/parameters you used to achieve a speedup of 3? I was trying for my application and the result were far off that (incl. being slower at very large thread counts on an HPC node). I would love to understand whether there is room for improvement - sequential kdtree construction has become a bottleneck in our the application...

@Mintpasokon
Copy link
Contributor

@dokempf I tested with Ryzen 7 5800X 8c/16t, DDR4 3200, with 1~20 millon 2D points to get 3x speedup. As you are using large amount of threads, I guess there might be some problem with memory access locality, or false sharing, in this line, since the datas in vAcc are random, in worst case, it would require much larger CPU cache for your dataset than actually accessed.
It will probably works by

  1. limit thread count, maybe to 8 or 16.
  2. change behavior of nanoflann, maybe make a copy of origin dataset and actually move datas in dataset rather than indexs in vAcc when building kd tree(in planeSplit()), which may help with locality.
    I have no access to my PC now, and I will try the second way when avaliable. Good luck!

@Mintpasokon
Copy link
Contributor

@dokempf I made some perf tests around number of points and thread count in kd tree building. It looks like this:
image
It seems that concurrent builds have a performance advantage when there are at least 50,000 points in the dataset. And there is a linear relationship between the ratio of speedup and the logarithm of points. What is the scale of your problem? It may be related.

@jlblancoc
Copy link
Owner

Thanks for the benchmarking, @Mintpasokon !!
If you feel like that, it would be great if you wanted to do a Pull Request to contribute your benchmark to the nanoflann benchmark repo... just in case we have time someday to put everything there in order and get a decent report :-) In that case, feel free of adding your name and/or affiliation to your files, of course.

@dokempf
Copy link
Contributor

dokempf commented Jun 26, 2023

The scale of my problems is much larger. I tested datasets of 100k-400M 3D double precision points. The smallest of these do still fit in L3 - still I only see speed-ups of up to 1.3 - if any.

@jlblancoc
Copy link
Owner

@dokempf Probably worth investigating an alternative implementation for such larger datasets with T threads:

  1. Split the input dataset in T clusters (to exploit locality of data, just split into T consecutive blocks)
  2. Run the "classic" single-thread method for each cluster.
  3. Merge the resulting tree nodes.

Overall, for n points, and T threads, the expected average time complexity (if I'm right... just a quick draft):

original: O(n log n) 
parallel: O( n/T · log(n/T)  + T · n/T · log(n/T) ) = O(T · n/T · log(n/T)) = O( n log(n/T) )

Of course, this is just the expected average, and the constants (obviated above) are key in real implementations, but in theory we could obtain some gain.

PS: I would love to, but I don't have bandwidth at present to try to implement and benchmark it, tough... :-(

@jlblancoc jlblancoc reopened this Jun 26, 2023
@jlblancoc
Copy link
Owner

PS: @dokempf if your data is not too sparse as to memory to become a game stopper, you could try a grid-based representation instead, e.g. here I have an implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants