Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curse of dimensionality with Nearest Neighbour approaches? #13

Open
mr-september opened this issue May 3, 2022 · 2 comments
Open

Curse of dimensionality with Nearest Neighbour approaches? #13

mr-september opened this issue May 3, 2022 · 2 comments

Comments

@mr-september
Copy link

Hi Authors,

Sorry this is less of a technical issue than a conceptual question.

You have alluded to the curse of dimensionality in the original paper, one major component of which is the dilution of "distances" (e.g. On the Surprising Behavior of Distance Metric in High-Dimensional Space, or a more accessible summary here).

Step 1 of EMBEDR relies on calculating NNs, the code appears to rely on default methods in numpy e.g. "euclidean", "l2", "sqeuclidean", ..., "sokalsneath", "yule"].

Do you have recommendations for these distance metrics to mitigate the "curse"? Or are there other parts of the algorithm to help with that?

@ejohnson643
Copy link
Owner

Hello!

Thanks for your question! Yes, the curse of dimensionality definitely impacts the accuracy of NN identification, however, the use of "fuzzy" similarities in t-SNE and UMAP helps to circumvent this problem. EMBEDR uses these same similarities to assess the quality of the embeddings. Furthermore, EMBEDR repeats the embedding process several times, which ideally will average out some of these NN errors.

More concretely, in testing we did not find qualitative differences in the output of the algorithm when the metric used to find NNs was changed. Depending on the data you are analyzing, using different metrics may be more appropriate (the Jaccard distance for analyzing documents, for example), but in general, using the same metric as the dimensionality reduction method of choice is the best option.

@mr-september
Copy link
Author

Thank you for the great response! Sorry to be a further bother, but do you have any recommended readings that discuss how t-SNE/UMAP/other DR tools help to mitigate the curse of dimensionality on NN identification? Or perhaps any brief comparisons using the data shared in this study? Sorry I know this would be quite a bit of work, and completely understand if the lab has other priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants