-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustering methods #1
Comments
That I couldn't give a meaningful answer to. I switched from using DBScan in an earlier iteration of this tool to hierachiacal clustering with an ANNOY index as an experiment to improve memory usage when grouping larger datasets with the understanding that it would mean slightly less accuracy but figured that wouldn't be an issue as in this case we're working with approximately similar embeddings rather than exactly similar ones anyway. If there is a performance benefit to this change in approach then it's not one I anticipated. In any case I plan to get around to putting together another update to switch this tool to using FAISS in place of hierachical clustering on an ANNOY index. This change should in theory scale to larger datasets better while maintaining the same quality of grouping results as my current approach. So, if you're looking to built a similar tool I'd recommend checking out FAISS. |
Nice job. After glancing at the code, the annoy related code is not providing you any benefits. You still compute the full distance matrix and then linkage is computing its own nearest neighbors (using its own metric!). To make this work with very large databases you'd need to find a clustering algorithm that works directly with an approximate nearest neighbor search. Note, this is the code that still computes all pairwise distances. Nothing approximate here!
|
Good idea and nice job! Work well with CLIP embeddings.
I also tried to cluster CLIP embeddings with DBSCAN. But the result is not good as hierarchical clustering. Could you explain the reason of using hierarchical clustering and why it works? Thanks a lot.
The text was updated successfully, but these errors were encountered: