NeighborList vs Index + Distance Matrices #25

JiminyCricket411 · 2025-02-04T01:34:57Z

Hi Aaron,

I was curious regarding the choice of the struct

using NeighborList = knncolle::NeighborList<Index_, Float_>;

Which is a std::vector<std::vector<std::pair<Index_, Float_>>>

Over a simple std::pair<Matrix<Index_>, Matrix<Float_>> of appropriate dimensions.
Maybe it was as result of porting the code, but I would expect the latter option to have better cache locality and maybe even slightly better memory usage.

The text was updated successfully, but these errors were encountered:

LTLA · 2025-02-04T03:11:48Z

IIRC the reason was just ergonomics. In some downstream applications, I had to re-sort by indices instead of distance, and it was easier to do so with a vector of pairs. Also it's easier to ensure that users supply a valid index and distance pair, whereas with separate containers, you always have to validate that each index has an associated distance.

Come to think of it, a few releases ago, I had switched to std::vector<std::pair<std::vector<Index_>, std::vector<Float_> > > in the hope that I would improve memory usage because there wouldn't be any padding when Index_ and Float_ were not of the same size. However, it turned out to be a pain to work with, so I switched back.

I'm not sure how much improvement to cache locality there might be. In most situations, you would want both the index and the distance, so you end up having to pull them both into cache anyway. With a pair, you'd get them both at once, whereas it might incur two separate look-ups if they were in separate containers. (But I don't really know for sure.)

Anyway, if you think that a pair of matrices is better for your application, you can easily just roll your own loop. I do that myself in the R package bindings, at least when k is constant for all observations.

Edit: just realized this is an issue on the umappp repo, in which case you might as well know that the resort-by-index occurs in combine_neighbor_sets(). Also in qdtsne where the re-sorting happens in the symmetrize() function. In both cases, we will typically end up with different numbers of neighbors for each observation so it wouldn't fit into a matrix anyway.

JiminyCricket411 · 2025-02-04T05:51:28Z

Ergonomics makes a lot of sense looking at combine_neighbor_sets.
Is this the equivalent of the fuzzy_simplical_set function from the original python repo?
Would be nice to have a less memory-fragmented way of dealing with the index, distance pairs; but perhaps that isn't possible based on what needs to be done in combine_neighbor_sets.

LTLA · 2025-02-04T07:35:42Z

Is this the equivalent of the fuzzy_simplical_set function from the original python repo?

I think so, but I translated it from the uwot C++ code, which in turn was translated from the original Python code; I've never actually looked at the latter myself. If I had to guess, it would have been translated from general_sset_union() but it was so long ago that nothing really rings a bell TBH.

Also, FWIW, the NeighborList is only used during initialization of the Status object; inside the Status, everything is effectively converted into a compressed sparse matrix for the optimization iterations, which should have much better memory locality. So switching out the NeighborList with something else is unlikely to improve performance because the optimization takes the longest time anyway. (Also, the random negative sampling means that the algorithm is constantly dancing across the dataset, which is probably a greater source of suboptimality if you're thinking about memory access.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeighborList vs Index + Distance Matrices #25

NeighborList vs Index + Distance Matrices #25

JiminyCricket411 commented Feb 4, 2025

LTLA commented Feb 4, 2025 •

edited

Loading

JiminyCricket411 commented Feb 4, 2025

LTLA commented Feb 4, 2025 •

edited

Loading

NeighborList vs Index + Distance Matrices #25

NeighborList vs Index + Distance Matrices #25

Comments

JiminyCricket411 commented Feb 4, 2025

LTLA commented Feb 4, 2025 • edited Loading

JiminyCricket411 commented Feb 4, 2025

LTLA commented Feb 4, 2025 • edited Loading

LTLA commented Feb 4, 2025 •

edited

Loading

LTLA commented Feb 4, 2025 •

edited

Loading