Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdbscan and sparse precomputed distance matrix #636

Open
KukumavMozolo opened this issue May 24, 2024 · 3 comments
Open

hdbscan and sparse precomputed distance matrix #636

KukumavMozolo opened this issue May 24, 2024 · 3 comments

Comments

@KukumavMozolo
Copy link

Hi!,

so i am working at the following problem i have millions of sparse data points that are very high dimensional.

Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.

My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph.
How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed min_cluster_size, min_samples and cluster_selection_epsilon how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?

@sa2329
Copy link

sa2329 commented Aug 21, 2024

@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix.

@KukumavMozolo
Copy link
Author

KukumavMozolo commented Aug 22, 2024

Unfortunately I gave up on it for now and just uniformly down-sampled the data:(

@KukumavMozolo
Copy link
Author

KukumavMozolo commented Sep 23, 2024

However, i revisited the problem and managed to pre compute a sparse distance matrix semi efficiently using numba(and numba-progress for verbosity) and only storing distances smaller than a threshold. Also i put an epsilon value on the diagonal of the distance matrix, not sure this is necessary. This is example code for data that is encoded in a scipy sparse csr format. One has to hand in data, indices, ind_ptr directly since csr_matrix is not well supported by numba. With the distance matrix computed one has to set hdbscan metric to "precomputed":

@numba.jit(nogil=True,parallel=True)
def _pairwise_distances_sparse_dist(n_points: int, indptr: np.ndarray, indices: np.ndarray, data: np.ndarray, metric: str, progress, threshold: float, epsilon: float, tup_type):
    if metric == "euclidean_distance":
        func = sparse_euclidean
    else:
        raise NotImplementedError

    results = [numba.typed.List.empty_list(tup_type)] * n_points
    for i in prange(n_points):
        i_start = indptr[i]
        i_end = indptr[i+1]
        i_indices = indices[i_start:i_end]
        i_data = data[i_start:i_end].astype(np.float32)
        data_col = numba.typed.List.empty_list(tup_type)
        data_col.append((epsilon, int(i)))
        for j in range(i+1,n_points):
            j_start = indptr[j]
            j_end = indptr[j+1]
            j_indices = indices[j_start:j_end]
            j_data = data[j_start:j_end].astype(np.float32)
            res = func(i_indices, i_data, j_indices, j_data)
            if res < threshold:
                data_col.append((res, j))
        results[i] = data_col
        progress.update(1)

    data = []
    indices = []
    row_ptr = [0]
    for idx, res in enumerate(results):
        row_ptr.append(row_ptr[idx] + len(res))
        for (d,c) in res:
            data.append(d)
            indices.append(c)
        results[idx] = numba.typed.List.empty_list(tup_type)

    return np.array(data), np.array(indices), np.array(row_ptr)

This potentially produces a disconnected graph in some cases, so one needs to join the disconnected components.
Here i just picked the first node in one graph and connected it to the first other nodes in the other graphs via a unified max_distance value. This might affect the solution hdbscan computes so one might want to do this differently. (e.g. connect them at their closest point with the actual distance for example)

def _connect_components(distance_matrix: csr_matrix, max_distance: float)->csr_matrix:
    n_components, labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
    unique_labes = np.unique(labels)
    start = time.time()
    label_idx_dict = {}
    for idx, l in enumerate(labels):
        if l not in label_idx_dict:
            label_idx_dict[l]=idx
    end = time.time()
    print(f"Computing label_idx_dict took {end-start} seconds")
    others = unique_labes.copy()
    rows = []
    cols = []
    data = []
    for idx, l in tqdm(enumerate(unique_labes), total=n_components):
        others = others[1:]
        for other in others:
            other_idx = label_idx_dict[other]
            rows.append(idx)
            cols.append(other_idx)
            data.append(max_distance)
            rows.append(other_idx)
            cols.append(idx)
            data.append(max_distance)

    new_mat = csr_matrix((data,(rows,cols)),shape=distance_matrix.shape)
    start = time.time()
    distance_matrix = distance_matrix + new_mat
    end = time.time()
    print(f"Computing merged distance_matrix took {end-start} seconds")
    new_n_components, new_labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
    print(new_n_components)

    return distance_matrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants