Using Taxometer on 200 metagenomes #390

bhagavadgitadu22 · 2025-01-22T17:36:18Z

Thanks for the great tool!

I want to try using taxometer on 200 metagenomes from a similar environment (stream biofilms with about 200 million paired-end reads of 150bp per sample). I have 40 million contigs>2000bp (about 200,000 per sample) that I annotated with MMseqs2 (database GTDB).

Because they are contigs I cannot really cluster them together except for the complete organisms (viruses....). Indeed even if the same organism is in 2 samples I do not see any reason why it would get fragmented in the same locations.

If I do not cluster though, primary read mappings are gonna be diluted between similar samples and I fear it will affect significantly the abundances. Indeed if the same organism is present in 2 samples very similar DNA is gonna be repeated twice and the reads are gonna map randomly to one or the other: the abundance of each organism per sample would be divided by 2 and this problem would scale with the number of samples where this organism is present.

Do you think I can run Taxometer despite this problem? Or could I allow multimapping to partly solve it (if Taxometer do not only consider primary reads)?

Also is Taxometer gonna be able to process that many contigs on 1 GPU and in less than 3 days? Over that I would have a problem in terms of available resources ;)

Thanks,

jakobnissen · 2025-01-27T12:33:04Z

Taxometer can use either pycoverm to estimate abundance, or input the abundances from a precomputed TSV file. Pycoverm does not use multimapping, and so does indeed dilute out the reads. However, in experiments, we have found that adding multi-mapping reads has no effect. It appears that this dilution is not a problem, at least below one million contigs. However, we have not tested anywhere close to 40M contigs.
Another option is to use strobealign --aemb, which will also be faster. This uses multi-mapping, splitting the reads deterministically as opposed to randomly.
My advice is to give it a try with strobealign --aemb.

I'm not sure it can do 40M contigs in 4 days on a GPU. I would guess yes, but you'll have to give it a go. Good luck!

jakobnissen · 2025-02-25T13:52:59Z

I'm going to close this as addressed

jakobnissen closed this as completed Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Taxometer on 200 metagenomes #390

Using Taxometer on 200 metagenomes #390

bhagavadgitadu22 commented Jan 22, 2025

jakobnissen commented Jan 27, 2025

jakobnissen commented Feb 25, 2025

Using Taxometer on 200 metagenomes #390

Using Taxometer on 200 metagenomes #390

Comments

bhagavadgitadu22 commented Jan 22, 2025

jakobnissen commented Jan 27, 2025

jakobnissen commented Feb 25, 2025