Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering methods #1

Open
qiankl opened this issue May 12, 2023 · 2 comments
Open

clustering methods #1

qiankl opened this issue May 12, 2023 · 2 comments

Comments

@qiankl
Copy link

qiankl commented May 12, 2023

Good idea and nice job! Work well with CLIP embeddings.

I also tried to cluster CLIP embeddings with DBSCAN. But the result is not good as hierarchical clustering. Could you explain the reason of using hierarchical clustering and why it works? Thanks a lot.

@LexCybermac
Copy link
Owner

That I couldn't give a meaningful answer to. I switched from using DBScan in an earlier iteration of this tool to hierachiacal clustering with an ANNOY index as an experiment to improve memory usage when grouping larger datasets with the understanding that it would mean slightly less accuracy but figured that wouldn't be an issue as in this case we're working with approximately similar embeddings rather than exactly similar ones anyway. If there is a performance benefit to this change in approach then it's not one I anticipated.

In any case I plan to get around to putting together another update to switch this tool to using FAISS in place of hierachical clustering on an ANNOY index. This change should in theory scale to larger datasets better while maintaining the same quality of grouping results as my current approach. So, if you're looking to built a similar tool I'd recommend checking out FAISS.

@violapaul
Copy link

Nice job. After glancing at the code, the annoy related code is not providing you any benefits. You still compute the full distance matrix and then linkage is computing its own nearest neighbors (using its own metric!).

To make this work with very large databases you'd need to find a clustering algorithm that works directly with an approximate nearest neighbor search.

Note, this is the code that still computes all pairwise distances. Nothing approximate here!

# Compute the distance matrix of the embeddings using the Annoy index
def compute_distance_matrix(all_embeddings, annoy_index):
    n = len(all_embeddings)
    distances = []

    for i in range(n):
        for j in range(i + 1, n):
            distance = annoy_index.get_distance(i, j)
            distances.append(distance)

    return distances

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants