Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeighborList vs Index + Distance Matrices #25

Open
JiminyCricket411 opened this issue Feb 4, 2025 · 3 comments
Open

NeighborList vs Index + Distance Matrices #25

JiminyCricket411 opened this issue Feb 4, 2025 · 3 comments

Comments

@JiminyCricket411
Copy link

Hi Aaron,

I was curious regarding the choice of the struct

using NeighborList = knncolle::NeighborList<Index_, Float_>;

Which is a std::vector<std::vector<std::pair<Index_, Float_>>>

Over a simple std::pair<Matrix<Index_>, Matrix<Float_>> of appropriate dimensions.
Maybe it was as result of porting the code, but I would expect the latter option to have better cache locality and maybe even slightly better memory usage.

@LTLA
Copy link
Collaborator

LTLA commented Feb 4, 2025

IIRC the reason was just ergonomics. In some downstream applications, I had to re-sort by indices instead of distance, and it was easier to do so with a vector of pairs. Also it's easier to ensure that users supply a valid index and distance pair, whereas with separate containers, you always have to validate that each index has an associated distance.

Come to think of it, a few releases ago, I had switched to std::vector<std::pair<std::vector<Index_>, std::vector<Float_> > > in the hope that I would improve memory usage because there wouldn't be any padding when Index_ and Float_ were not of the same size. However, it turned out to be a pain to work with, so I switched back.

I'm not sure how much improvement to cache locality there might be. In most situations, you would want both the index and the distance, so you end up having to pull them both into cache anyway. With a pair, you'd get them both at once, whereas it might incur two separate look-ups if they were in separate containers. (But I don't really know for sure.)

Anyway, if you think that a pair of matrices is better for your application, you can easily just roll your own loop. I do that myself in the R package bindings, at least when k is constant for all observations.

Edit: just realized this is an issue on the umappp repo, in which case you might as well know that the resort-by-index occurs in combine_neighbor_sets(). Also in qdtsne where the re-sorting happens in the symmetrize() function. In both cases, we will typically end up with different numbers of neighbors for each observation so it wouldn't fit into a matrix anyway.

@JiminyCricket411
Copy link
Author

Ergonomics makes a lot of sense looking at combine_neighbor_sets.
Is this the equivalent of the fuzzy_simplical_set function from the original python repo?
Would be nice to have a less memory-fragmented way of dealing with the index, distance pairs; but perhaps that isn't possible based on what needs to be done in combine_neighbor_sets.

@LTLA
Copy link
Collaborator

LTLA commented Feb 4, 2025

Is this the equivalent of the fuzzy_simplical_set function from the original python repo?

I think so, but I translated it from the uwot C++ code, which in turn was translated from the original Python code; I've never actually looked at the latter myself. If I had to guess, it would have been translated from general_sset_union() but it was so long ago that nothing really rings a bell TBH.

Also, FWIW, the NeighborList is only used during initialization of the Status object; inside the Status, everything is effectively converted into a compressed sparse matrix for the optimization iterations, which should have much better memory locality. So switching out the NeighborList with something else is unlikely to improve performance because the optimization takes the longest time anyway. (Also, the random negative sampling means that the algorithm is constantly dancing across the dataset, which is probably a greater source of suboptimality if you're thinking about memory access.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants