hdbscan and sparse precomputed distance matrix #636

KukumavMozolo · 2024-05-24T05:11:30Z

Hi!,

so i am working at the following problem i have millions of sparse data points that are very high dimensional.

Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.

My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph.
How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed min_cluster_size, min_samples and cluster_selection_epsilon how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?

The text was updated successfully, but these errors were encountered:

sa2329 · 2024-08-21T17:27:03Z

@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix.

KukumavMozolo · 2024-08-22T04:48:29Z

Unfortunately I gave up on it for now and just uniformly down-sampled the data:(

KukumavMozolo · 2024-09-23T07:42:48Z

However, i revisited the problem and managed to pre compute a sparse distance matrix semi efficiently using numba(and numba-progress for verbosity) and only storing distances smaller than a threshold. Also i put an epsilon value on the diagonal of the distance matrix, not sure this is necessary. This is example code for data that is encoded in a scipy sparse csr format. One has to hand in data, indices, ind_ptr directly since csr_matrix is not well supported by numba. With the distance matrix computed one has to set hdbscan metric to "precomputed":

@numba.jit(nogil=True,parallel=True)
def _pairwise_distances_sparse_dist(n_points: int, indptr: np.ndarray, indices: np.ndarray, data: np.ndarray, metric: str, progress, threshold: float, epsilon: float, tup_type):
    if metric == "euclidean_distance":
        func = sparse_euclidean
    else:
        raise NotImplementedError

    results = [numba.typed.List.empty_list(tup_type)] * n_points
    for i in prange(n_points):
        i_start = indptr[i]
        i_end = indptr[i+1]
        i_indices = indices[i_start:i_end]
        i_data = data[i_start:i_end].astype(np.float32)
        data_col = numba.typed.List.empty_list(tup_type)
        data_col.append((epsilon, int(i)))
        for j in range(i+1,n_points):
            j_start = indptr[j]
            j_end = indptr[j+1]
            j_indices = indices[j_start:j_end]
            j_data = data[j_start:j_end].astype(np.float32)
            res = func(i_indices, i_data, j_indices, j_data)
            if res < threshold:
                data_col.append((res, j))
        results[i] = data_col
        progress.update(1)

    data = []
    indices = []
    row_ptr = [0]
    for idx, res in enumerate(results):
        row_ptr.append(row_ptr[idx] + len(res))
        for (d,c) in res:
            data.append(d)
            indices.append(c)
        results[idx] = numba.typed.List.empty_list(tup_type)

    return np.array(data), np.array(indices), np.array(row_ptr)

This potentially produces a disconnected graph in some cases, so one needs to join the disconnected components.
Here i just picked the first node in one graph and connected it to the first other nodes in the other graphs via a unified max_distance value. This might affect the solution hdbscan computes so one might want to do this differently. (e.g. connect them at their closest point with the actual distance for example)

def _connect_components(distance_matrix: csr_matrix, max_distance: float)->csr_matrix:
    n_components, labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
    unique_labes = np.unique(labels)
    start = time.time()
    label_idx_dict = {}
    for idx, l in enumerate(labels):
        if l not in label_idx_dict:
            label_idx_dict[l]=idx
    end = time.time()
    print(f"Computing label_idx_dict took {end-start} seconds")
    others = unique_labes.copy()
    rows = []
    cols = []
    data = []
    for idx, l in tqdm(enumerate(unique_labes), total=n_components):
        others = others[1:]
        for other in others:
            other_idx = label_idx_dict[other]
            rows.append(idx)
            cols.append(other_idx)
            data.append(max_distance)
            rows.append(other_idx)
            cols.append(idx)
            data.append(max_distance)

    new_mat = csr_matrix((data,(rows,cols)),shape=distance_matrix.shape)
    start = time.time()
    distance_matrix = distance_matrix + new_mat
    end = time.time()
    print(f"Computing merged distance_matrix took {end-start} seconds")
    new_n_components, new_labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
    print(new_n_components)

    return distance_matrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdbscan and sparse precomputed distance matrix #636

hdbscan and sparse precomputed distance matrix #636

KukumavMozolo commented May 24, 2024

sa2329 commented Aug 21, 2024

KukumavMozolo commented Aug 22, 2024 •

edited

Loading

KukumavMozolo commented Sep 23, 2024 •

edited

Loading

hdbscan and sparse precomputed distance matrix #636

hdbscan and sparse precomputed distance matrix #636

Comments

KukumavMozolo commented May 24, 2024

sa2329 commented Aug 21, 2024

KukumavMozolo commented Aug 22, 2024 • edited Loading

KukumavMozolo commented Sep 23, 2024 • edited Loading

KukumavMozolo commented Aug 22, 2024 •

edited

Loading

KukumavMozolo commented Sep 23, 2024 •

edited

Loading