|
10 | 10 | //! 3. Iterative edge refinement |
11 | 11 | //! |
12 | 12 | //! Generic over data type `T` and distance function. |
| 13 | +//! |
| 14 | +//! While https://github.com/cmuparlay/ParlayANN is not specifically for MST construction, |
| 15 | +//! it gives some idea how far one can scale ANN graph construction. |
| 16 | +//! |
| 17 | +//! One problem with very large graphs are cache misses. To address them, a common procedure |
| 18 | +//! is to partition the data/graph into smaller buckets and solve the k-NN problem within each bucket. |
| 19 | +//! To make this actually work, one needs to rearrange the data points according to the partitioning |
| 20 | +//! to improve cache locality. A quite strong partitioning can be obtained via masked sorting of geo-filters. |
| 21 | +//! For large chunks of the sorted data, a partial MST can be computed and then be merged with the previously |
| 22 | +//! computed approximate MSTs. This approach is probably more efficient than the windowing approach currently |
| 23 | +//! implemented in Blackbird, since many different masks are needed to get accurate results. |
| 24 | +//! |
| 25 | +//! The default sorting implementation of Rayon is not sufficient for very large data sets though. |
| 26 | +//! The problem is that our data points are pretty large and copying them is costly. But leaving |
| 27 | +//! them in place results in cache misses. Therefore, one has to group the data into managable chunks whose data |
| 28 | +//! is all in one contiguous block of memory and the actual sorting can be done via references. |
| 29 | +//! The simplest solution to chunking is to first sample a random subset of the data. Those data points |
| 30 | +//! are collected into a contiguous block and first sorted via indirection and then the data can be rearranged if needed. |
| 31 | +//! In the second phase, all data is partitioned according to the sample points. This can be done efficiently |
| 32 | +//! in parallel. The intermediate output of this step is the partition each data point belongs to. |
| 33 | +//! At the end of this step, the data points can be copied to their locations within their partitions. |
| 34 | +//! In the last phase, each partition can be sorted individually via indirection and again be rearranged at the end if needed. |
| 35 | +//! Note, that all those steps are cache friendly and highly parallelisable. |
13 | 36 |
|
14 | 37 | mod nn_descent; |
15 | 38 | mod union_find; |
@@ -802,7 +825,7 @@ mod tests { |
802 | 825 | fn test_large_scale_vs_exact() { |
803 | 826 | use rand::distributions::{Distribution, Uniform}; |
804 | 827 |
|
805 | | - const N: usize = 1_000_000; |
| 828 | + const N: usize = 10_000_000; |
806 | 829 | const DIM: usize = 10; |
807 | 830 |
|
808 | 831 | println!("Generating {} random {}-dimensional points...", N, DIM); |
|
0 commit comments