Add rare cell denoising #27

jkobject · 2025-02-24T14:36:07Z

Hello,

Per our preprint on scPRINT, we mention the task of denoising. While we do not use the same metrics, the methods compared and objectives are very similar: https://arc.net/l/quote/pcwqccbf

We say that denoising is quite similar to doing in-silico library size augmentation. In this context it is not as interesting to denoise common cells but rare cells subgroup or subclusters. Looking at a model's ability on these subcluster is very interesting.

I would push to duplicate the metrics and using the cell type annotation of each dataset, report the denoising ability on the rarest cell type of each dataset (or set of rarest cell types with less than N (=200?) cells.

Let me know what you think.

Best,

lazappi · 2025-02-27T06:33:35Z

This sounds fairly reasonable to me but I'm not so familiar with this task. It's probably more a question for the maintainers (@wes-lewis, @scottgigante)

wes-lewis · 2025-03-12T01:22:34Z

Thank you @lazappi for the tag! To the OP, I would suggest to consider adding the suggested function as an additional metric in their own branch of the repo, via the documentation. I think that metrics using only the rarest cell types may still be a very different number of cells depending on the dataset, so I would suggest carefully choosing the datasets/subsets used for this purpose. I also wonder if scPRINT is essentially different from other methods in that it uses pre-training on a large external dataset. I think it would be great to add it in the meantime, and that we may wish to update the output figures to track which tools are supervised vs unsupervised in the future. I also wonder if there are other supervised methods to compare against, for the denoising task.

(edited from an earlier comment)

wes-lewis · 2025-03-12T04:05:42Z

@jkobject I see that you made several commits and have a PR in progress. I know that @lazappi is more familiar with the viash framework than I am, but if you have questions related to the denoising task or would like to make larger changes I'd be happy to review or discuss.

jkobject · 2025-03-12T07:07:00Z

Hello @wes-lewis ,

It is something I can look into to add it myself as a second metric (which is what I had in mind).

ScPRINT is definitely different from the others and we would expect KNN based method to perform worse in this second metric for sure. But the ability to denoise (or increase library size / impute missing zeros ...) is not as interesting if one already has many closely related cells. I believe it becomes more meaningful for the case where my dataset only has a few cells of this cell type A. There I might want to impute some zeros or increase library size.

Best,

rcannood · 2025-03-12T08:28:40Z

Thanks for your interest @jkobject, and thanks for providing a link to the docs!

Related to the docs: Here is the specific page related to adding a metric: https://openproblems.bio/documentation/create_component/add_a_metric. I did notice a rendering artifact in the documentation, so in case it helps here is a link to the old documentation https://openproblems.netlify.app/documentation/create_component/add_a_metric.

Then again, this is not your first contribution so perhaps the information is redundant at this point ^^

For your info, there is always a weekly co-working meeting on Discord. If you have technical questions I'm always happy to help!

jkobject · 2025-03-14T08:31:54Z

That is great. It seems right now that there is only a couple datasets used. Why isn't it using all datasets that other methods like batch correction and label prediction have? I think I would need these additional datasets to get some meaningful results (there is quite some variance in results across sequencing technologies and cell type). How could we make it run on all those datasets?

Also it seems right now only 3000 cells are kept per dataset. I don't understand this.

wes-lewis · 2025-03-17T03:42:30Z

Hi @jkobject

Apologies for the delay.

Per your comment about rare cell types being interesting for de-noising, I think I see your point about how ScPRINT should offer greater accuracy than KNN-based methods. I would think that all unsupervised methods will perform worse at denoising rare cells than methods that use pre-training or other supervision to provide external information that might help to differently treat nuanced and rare cell types.

We should maybe keep track of whether methods are supervised or unsupervised and include this in our output figure(s) for the website. I wonder if this could be done with a "method type" column (supervised/unsupervised), and whether we could add a filter based on method type to determine which are viewed. This could be important in the long run, since other supervised models could be added in the future.

It looks like 4000 cells are kept per dataset currently, mediated by the parameter 'n_obs_limit'. I forget what the reason for imposing a maximum number of cells was, but it could have just been an early implementation choice we stuck with. I think it would be fine to get rid of this, especially if we were to consider more datasets for this task.

I agree with you that more datasets would be ideal. I'm not sure what the protocol for adding datasets is at the moment. I wonder if @rcannood or @lazappi could offer a bit of clarity on how to choose specific datasets to include or exclude now that we've moved to viash. I don't see an explicit snippet in the denoising repo where this is done, and in 1.0 this was pretty explicitly organized with .py files with imports/preprocessing for each dataset.

lazappi · 2025-03-17T07:22:27Z

A lot of the dataset processing was moved to a separate step so that it wasn't repeated for each task but some tasks do have additional processing or use specialised datasets (@rcannood knows more of the details).

My guess is because this tasks hasn't had many updates recently it hasn't been re-run with the newer datasets which would be why there are only a few small datasets in the current results.

jkobject · 2025-03-17T13:04:26Z

in the preprocess_dataset/script.py there seems to be an n_obs_limit option that is set to 4000 obs per datasets.

wes-lewis · 2025-03-17T15:14:58Z

@jkobject Yup, n_obs_limit is specific to this task and it can be removed. I will add a PR to get rid of it.

jkobject added the enhancement New feature or request label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rare cell denoising #27

Add rare cell denoising #27

jkobject commented Feb 24, 2025 •

edited

Loading

lazappi commented Feb 27, 2025

wes-lewis commented Mar 12, 2025 •

edited

Loading

wes-lewis commented Mar 12, 2025 •

edited

Loading

jkobject commented Mar 12, 2025

rcannood commented Mar 12, 2025

jkobject commented Mar 14, 2025

wes-lewis commented Mar 17, 2025

lazappi commented Mar 17, 2025

jkobject commented Mar 17, 2025

wes-lewis commented Mar 17, 2025 •

edited

Loading

Add rare cell denoising #27

Add rare cell denoising #27

Comments

jkobject commented Feb 24, 2025 • edited Loading

lazappi commented Feb 27, 2025

wes-lewis commented Mar 12, 2025 • edited Loading

wes-lewis commented Mar 12, 2025 • edited Loading

jkobject commented Mar 12, 2025

rcannood commented Mar 12, 2025

jkobject commented Mar 14, 2025

wes-lewis commented Mar 17, 2025

lazappi commented Mar 17, 2025

jkobject commented Mar 17, 2025

wes-lewis commented Mar 17, 2025 • edited Loading

jkobject commented Feb 24, 2025 •

edited

Loading

wes-lewis commented Mar 12, 2025 •

edited

Loading

wes-lewis commented Mar 12, 2025 •

edited

Loading

wes-lewis commented Mar 17, 2025 •

edited

Loading