Skip to content

Add rare cell denoising #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jkobject opened this issue Feb 24, 2025 · 10 comments
Open

Add rare cell denoising #27

jkobject opened this issue Feb 24, 2025 · 10 comments
Labels
enhancement New feature or request

Comments

@jkobject
Copy link
Contributor

jkobject commented Feb 24, 2025

Hello,

Per our preprint on scPRINT, we mention the task of denoising. While we do not use the same metrics, the methods compared and objectives are very similar: https://arc.net/l/quote/pcwqccbf

We say that denoising is quite similar to doing in-silico library size augmentation. In this context it is not as interesting to denoise common cells but rare cells subgroup or subclusters. Looking at a model's ability on these subcluster is very interesting.

I would push to duplicate the metrics and using the cell type annotation of each dataset, report the denoising ability on the rarest cell type of each dataset (or set of rarest cell types with less than N (=200?) cells.

Let me know what you think.

Best,

@jkobject jkobject added the enhancement New feature or request label Feb 24, 2025
@lazappi
Copy link
Member

lazappi commented Feb 27, 2025

This sounds fairly reasonable to me but I'm not so familiar with this task. It's probably more a question for the maintainers (@wes-lewis, @scottgigante)

@wes-lewis
Copy link
Contributor

wes-lewis commented Mar 12, 2025

Thank you @lazappi for the tag! To the OP, I would suggest to consider adding the suggested function as an additional metric in their own branch of the repo, via the documentation. I think that metrics using only the rarest cell types may still be a very different number of cells depending on the dataset, so I would suggest carefully choosing the datasets/subsets used for this purpose. I also wonder if scPRINT is essentially different from other methods in that it uses pre-training on a large external dataset. I think it would be great to add it in the meantime, and that we may wish to update the output figures to track which tools are supervised vs unsupervised in the future. I also wonder if there are other supervised methods to compare against, for the denoising task.

(edited from an earlier comment)

@wes-lewis
Copy link
Contributor

wes-lewis commented Mar 12, 2025

@jkobject I see that you made several commits and have a PR in progress. I know that @lazappi is more familiar with the viash framework than I am, but if you have questions related to the denoising task or would like to make larger changes I'd be happy to review or discuss.

@jkobject
Copy link
Contributor Author

Hello @wes-lewis ,

It is something I can look into to add it myself as a second metric (which is what I had in mind).

ScPRINT is definitely different from the others and we would expect KNN based method to perform worse in this second metric for sure. But the ability to denoise (or increase library size / impute missing zeros ...) is not as interesting if one already has many closely related cells. I believe it becomes more meaningful for the case where my dataset only has a few cells of this cell type A. There I might want to impute some zeros or increase library size.

Best,

@rcannood
Copy link
Member

Thanks for your interest @jkobject, and thanks for providing a link to the docs!

Related to the docs: Here is the specific page related to adding a metric: https://openproblems.bio/documentation/create_component/add_a_metric. I did notice a rendering artifact in the documentation, so in case it helps here is a link to the old documentation https://openproblems.netlify.app/documentation/create_component/add_a_metric.

Then again, this is not your first contribution so perhaps the information is redundant at this point ^^

For your info, there is always a weekly co-working meeting on Discord. If you have technical questions I'm always happy to help!

@jkobject
Copy link
Contributor Author

That is great. It seems right now that there is only a couple datasets used. Why isn't it using all datasets that other methods like batch correction and label prediction have? I think I would need these additional datasets to get some meaningful results (there is quite some variance in results across sequencing technologies and cell type). How could we make it run on all those datasets?

Also it seems right now only 3000 cells are kept per dataset. I don't understand this.

@wes-lewis
Copy link
Contributor

Hi @jkobject

Apologies for the delay.

Per your comment about rare cell types being interesting for de-noising, I think I see your point about how ScPRINT should offer greater accuracy than KNN-based methods. I would think that all unsupervised methods will perform worse at denoising rare cells than methods that use pre-training or other supervision to provide external information that might help to differently treat nuanced and rare cell types.

We should maybe keep track of whether methods are supervised or unsupervised and include this in our output figure(s) for the website. I wonder if this could be done with a "method type" column (supervised/unsupervised), and whether we could add a filter based on method type to determine which are viewed. This could be important in the long run, since other supervised models could be added in the future.

It looks like 4000 cells are kept per dataset currently, mediated by the parameter 'n_obs_limit'. I forget what the reason for imposing a maximum number of cells was, but it could have just been an early implementation choice we stuck with. I think it would be fine to get rid of this, especially if we were to consider more datasets for this task.

I agree with you that more datasets would be ideal. I'm not sure what the protocol for adding datasets is at the moment. I wonder if @rcannood or @lazappi could offer a bit of clarity on how to choose specific datasets to include or exclude now that we've moved to viash. I don't see an explicit snippet in the denoising repo where this is done, and in 1.0 this was pretty explicitly organized with .py files with imports/preprocessing for each dataset.

@lazappi
Copy link
Member

lazappi commented Mar 17, 2025

A lot of the dataset processing was moved to a separate step so that it wasn't repeated for each task but some tasks do have additional processing or use specialised datasets (@rcannood knows more of the details).

My guess is because this tasks hasn't had many updates recently it hasn't been re-run with the newer datasets which would be why there are only a few small datasets in the current results.

@jkobject
Copy link
Contributor Author

in the preprocess_dataset/script.py there seems to be an n_obs_limit option that is set to 4000 obs per datasets.

@wes-lewis
Copy link
Contributor

wes-lewis commented Mar 17, 2025

@jkobject Yup, n_obs_limit is specific to this task and it can be removed. I will add a PR to get rid of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants