Can I train scVI from scratch on my own cancer dataset, or do I need a reference? #3344

AlbertoTh98 · 2025-05-21T10:51:03Z

AlbertoTh98
May 21, 2025

Hi everyone,

I'm Alberto Atencia, a bioinformatics PhD student.

I'm transitioning away from classical pipelines like Seurat and exploring scVI, which I find extremely interesting. However, there's one key aspect I don’t fully understand and would appreciate clarification on:

Can I train scVI directly from scratch on my own dataset (e.g., ovarian cancer scRNA-seq), or is it preferable—or even necessary—to train on or include a reference dataset?

If training from scratch is acceptable, what guarantees or considerations should I keep in mind regarding performance, especially for tasks like clustering or batch correction?
On the other hand, if pretraining on another dataset is beneficial, what type of dataset would be most appropriate (same tissue, same platform, etc.)? and is it viable to rely on another's annotated dataset?

Thanks a lot for your time and for building such a great tool.
And I’m sorry if this is a very basic question—I just want to ensure I’m using scVI correctly from the start.

Best regards,
Alberto Atencia

Answered by ori-kron-wis

May 21, 2025

Hello Alberto, and thank you for using the tool.

You can train a scvi model from both scratch and reference.

We usually use UMAPs to assess the quality of the batch correction, and scvi-metrics to also assess it quantitatively, perhaps comparing to other methods or embeddings (like PCA at the very basic).
Another basic thing to consider is to check your loss metrics during training, see that you converge the ELBO loss to a minimum without performing overfitting on a validation test.

If you plan to use the reference model it is recommended to use the most relevant kind of data of course. If it was ovarian, then ovarian, with the same platform use. As far as the reference data is from your …

View full answer

ori-kron-wis · 2025-05-21T12:54:14Z

ori-kron-wis
May 21, 2025
Maintainer

Hello Alberto, and thank you for using the tool.

You can train a scvi model from both scratch and reference.

We usually use UMAPs to assess the quality of the batch correction, and scvi-metrics to also assess it quantitatively, perhaps comparing to other methods or embeddings (like PCA at the very basic).
Another basic thing to consider is to check your loss metrics during training, see that you converge the ELBO loss to a minimum without performing overfitting on a validation test.

If you plan to use the reference model it is recommended to use the most relevant kind of data of course. If it was ovarian, then ovarian, with the same platform use. As far as the reference data is from your query data, then the less likely the transfer learning will work well (we use scArches method).

Can you rely on another annotation? This is a broader question, not necessarily related to SCVI. It depends if you trust it and have other choices, like the capability to annotate by yourself.

For more information, see our documentation and tutorials with many examples that can help you: https://docs.scvi-tools.org/en/latest/index.html
Please post questions in our discourse forum, which is the main place for Q&A, as other might look there: https://discourse.scverse.org/c/help/scvi-tools/7

1 reply

AlbertoTh98 May 21, 2025
Author

Thanks a lot for your time and effort! It is much clearer now!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can I train scVI from scratch on my own cancer dataset, or do I need a reference? #3344

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can I train scVI from scratch on my own cancer dataset, or do I need a reference? #3344

Uh oh!

AlbertoTh98 May 21, 2025

Replies: 1 comment · 1 reply

Uh oh!

ori-kron-wis May 21, 2025 Maintainer

Uh oh!

AlbertoTh98 May 21, 2025 Author

AlbertoTh98
May 21, 2025

Replies: 1 comment 1 reply

ori-kron-wis
May 21, 2025
Maintainer

AlbertoTh98 May 21, 2025
Author