Subsample by observations grouping #987

chsher · 2020-01-12T23:32:43Z

Additional function parameters / changed functionality / changed defaults?
New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
New plotting function: A kind of plot you would like to seein sc.pl?
External tools: Do you know an existing package that should go into sc.external.*?
Other?

Related to scanpy.pp.subsample, it would be useful to have a subsampling tool that subsamples based on the key of an observations grouping. E.g., if I have an observation key 'MyGroup' with possible values ['A', 'B'], and there are 10,000 cells of type 'A' and 2,000 cells of type 'B' and I want only max 5,000 cells of each type, then this function would subsample 5,000 cells of type 'A' but retain all 2,000 cells of type 'B'.

The text was updated successfully, but these errors were encountered:

LuckyMD · 2020-01-14T08:36:18Z

Something like this should work. Note, this is not tested.

target_cells = 5000

adatas = [adata[adata.obs[cluster_key].isin(clust)] for clust in adata.obs[cluster_key].cat.categories]

for dat in adatas:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_downsampled = adatas[0].concatenate(*adatas[1:])

Hope that helps.

chsher · 2020-12-03T20:22:35Z

Thank you @LuckyMD, it worked!

giovp · 2021-11-18T14:58:04Z

I'll reopen this cause I think it's quite relevant still and could be very straightforward to implement with sklearn resample

also, there is an entire package for subsampling strategies which is probably quite relevant: https://github.com/scikit-learn-contrib/imbalanced-learn

line here for reference: https://github.com/theislab/scanpy/blob/48cc7b38f1f31a78902a892041902cc810ddfcd3/scanpy/preprocessing/_simple.py#L857

giovp · 2022-02-17T09:48:18Z

back here reminding myself that this would be very useful feature to have...

ivirshup · 2022-02-17T15:42:44Z

@bio-la also expressed some interest here on MM

@giovp, did you have a particular strategy in mind for resampling?

giovp · 2022-02-17T16:16:55Z

So assuming that we are only interested in downsampling, then I'd say NearMiss and related are straightforward and scalable (just need to compute a kmeans whcih is really fast)

giovp · 2022-02-28T12:52:44Z

also, the fact that reshuflling is performed is not in docs and should be documented. @bio-la do you plan to work on this?

ivirshup · 2022-02-28T13:17:08Z

then I'd say NearMiss and related are straightforward and scalable (just need to compute a kmeans whcih is really fast)

For sampling from datasets, I would want to go with either extremely straightforward or something that has been shown to work. Maybe we could start with use provided labels to downsample by?

reshuflling is performed

Reshuffling meaning that the order is changed?

ivirshup · 2022-02-28T13:19:26Z

Linking some previous discussion:

chansigit · 2022-04-08T12:19:48Z

clust

in scanpy1.8 , this works

`target_cells = 3000

adatas = [adata_train[adata_train.obs[cluster_key].isin([clust])] for clust in adata_train.obs[cluster_key].cat.categories]

for dat in adatas:
if dat.n_obs > target_cells:
sc.pp.subsample(dat, n_obs=target_cells, random_state=0)

adata_train_downsampled1 = adatas[0].concatenate(*adatas[1:])`

stefanpeidli · 2023-01-18T14:52:36Z

This function at least subsamples all classes in an obs column to the same number of cells. Would be straightforward to modify to what you probably think of.

def obs_key_wise_subsampling(adata, obs_key, N):
    '''
    Subsample each class to same cell numbers (N). Classes are given by obs_key pointing to categorical in adata.obs.
    '''
    counts = adata.obs[obs_key].value_counts()
    # subsample indices per group defined by obs_key
    indices = [np.random.choice(adata.obs_names[adata.obs[obs_key]==group], size=N, replace=False) for group in counts.index]
    selection = np.hstack(np.array(indices))
    return adata[selection].copy()

royfrancis · 2023-01-19T14:25:21Z

@stefanpeidli's code gives this error

ValueError: Cannot take a larger sample than population when 'replace=False'

If a group has less than required number observations, it shouldn't subsample.

target_cells = 1000
cluster_key = "cell_type"

grouped = adata.obs.groupby(cluster_key)
downsampled_indices = []

for _, group in grouped:
    if len(group) > target_cells:
        downsampled_indices.extend(group.sample(target_cells).index)
    else:
        downsampled_indices.extend(group.index)

adata_downsampled = adata[downsampled_indices]

chsher added the Enhancement ✨ label Jan 12, 2020

chsher closed this as completed Dec 3, 2020

giovp reopened this Nov 18, 2021

ivirshup added this to Scaling single cell analysis Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsample by observations grouping #987

Subsample by observations grouping #987

chsher commented Jan 12, 2020

LuckyMD commented Jan 14, 2020

chsher commented Dec 3, 2020

giovp commented Nov 18, 2021

giovp commented Feb 17, 2022

ivirshup commented Feb 17, 2022

giovp commented Feb 17, 2022

giovp commented Feb 28, 2022 •

edited

Loading

ivirshup commented Feb 28, 2022

ivirshup commented Feb 28, 2022

chansigit commented Apr 8, 2022

stefanpeidli commented Jan 18, 2023 •

edited by flying-sheep

Loading

royfrancis commented Jan 19, 2023 •

edited

Loading

Subsample by observations grouping #987

Subsample by observations grouping #987

Comments

chsher commented Jan 12, 2020

LuckyMD commented Jan 14, 2020

chsher commented Dec 3, 2020

giovp commented Nov 18, 2021

giovp commented Feb 17, 2022

ivirshup commented Feb 17, 2022

giovp commented Feb 17, 2022

giovp commented Feb 28, 2022 • edited Loading

ivirshup commented Feb 28, 2022

ivirshup commented Feb 28, 2022

chansigit commented Apr 8, 2022

stefanpeidli commented Jan 18, 2023 • edited by flying-sheep Loading

royfrancis commented Jan 19, 2023 • edited Loading

giovp commented Feb 28, 2022 •

edited

Loading

stefanpeidli commented Jan 18, 2023 •

edited by flying-sheep

Loading

royfrancis commented Jan 19, 2023 •

edited

Loading