clustering methods #1

qiankl · 2023-05-12T07:46:36Z

Good idea and nice job! Work well with CLIP embeddings.

I also tried to cluster CLIP embeddings with DBSCAN. But the result is not good as hierarchical clustering. Could you explain the reason of using hierarchical clustering and why it works? Thanks a lot.

LexCybermac · 2023-05-17T13:16:59Z

That I couldn't give a meaningful answer to. I switched from using DBScan in an earlier iteration of this tool to hierachiacal clustering with an ANNOY index as an experiment to improve memory usage when grouping larger datasets with the understanding that it would mean slightly less accuracy but figured that wouldn't be an issue as in this case we're working with approximately similar embeddings rather than exactly similar ones anyway. If there is a performance benefit to this change in approach then it's not one I anticipated.

In any case I plan to get around to putting together another update to switch this tool to using FAISS in place of hierachical clustering on an ANNOY index. This change should in theory scale to larger datasets better while maintaining the same quality of grouping results as my current approach. So, if you're looking to built a similar tool I'd recommend checking out FAISS.

violapaul · 2025-01-29T15:51:20Z

Nice job. After glancing at the code, the annoy related code is not providing you any benefits. You still compute the full distance matrix and then linkage is computing its own nearest neighbors (using its own metric!).

To make this work with very large databases you'd need to find a clustering algorithm that works directly with an approximate nearest neighbor search.

Note, this is the code that still computes all pairwise distances. Nothing approximate here!

# Compute the distance matrix of the embeddings using the Annoy index
def compute_distance_matrix(all_embeddings, annoy_index):
    n = len(all_embeddings)
    distances = []

    for i in range(n):
        for j in range(i + 1, n):
            distance = annoy_index.get_distance(i, j)
            distances.append(distance)

    return distances

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering methods #1

clustering methods #1

qiankl commented May 12, 2023

LexCybermac commented May 17, 2023

violapaul commented Jan 29, 2025

clustering methods #1

clustering methods #1

Comments

qiankl commented May 12, 2023

LexCybermac commented May 17, 2023

violapaul commented Jan 29, 2025