Can I train scVI from scratch on my own cancer dataset, or do I need a reference? #3344
-
Hi everyone, I'm Alberto Atencia, a bioinformatics PhD student. I'm transitioning away from classical pipelines like Seurat and exploring scVI, which I find extremely interesting. However, there's one key aspect I don’t fully understand and would appreciate clarification on: Can I train scVI directly from scratch on my own dataset (e.g., ovarian cancer scRNA-seq), or is it preferable—or even necessary—to train on or include a reference dataset? If training from scratch is acceptable, what guarantees or considerations should I keep in mind regarding performance, especially for tasks like clustering or batch correction? Thanks a lot for your time and for building such a great tool. Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello Alberto, and thank you for using the tool. You can train a scvi model from both scratch and reference. We usually use UMAPs to assess the quality of the batch correction, and scvi-metrics to also assess it quantitatively, perhaps comparing to other methods or embeddings (like PCA at the very basic). If you plan to use the reference model it is recommended to use the most relevant kind of data of course. If it was ovarian, then ovarian, with the same platform use. As far as the reference data is from your query data, then the less likely the transfer learning will work well (we use scArches method). Can you rely on another annotation? This is a broader question, not necessarily related to SCVI. It depends if you trust it and have other choices, like the capability to annotate by yourself. For more information, see our documentation and tutorials with many examples that can help you: https://docs.scvi-tools.org/en/latest/index.html |
Beta Was this translation helpful? Give feedback.
Hello Alberto, and thank you for using the tool.
You can train a scvi model from both scratch and reference.
We usually use UMAPs to assess the quality of the batch correction, and scvi-metrics to also assess it quantitatively, perhaps comparing to other methods or embeddings (like PCA at the very basic).
Another basic thing to consider is to check your loss metrics during training, see that you converge the ELBO loss to a minimum without performing overfitting on a validation test.
If you plan to use the reference model it is recommended to use the most relevant kind of data of course. If it was ovarian, then ovarian, with the same platform use. As far as the reference data is from your …