Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Taxometer on 200 metagenomes #390

Closed
bhagavadgitadu22 opened this issue Jan 22, 2025 · 2 comments
Closed

Using Taxometer on 200 metagenomes #390

bhagavadgitadu22 opened this issue Jan 22, 2025 · 2 comments

Comments

@bhagavadgitadu22
Copy link

Thanks for the great tool!

I want to try using taxometer on 200 metagenomes from a similar environment (stream biofilms with about 200 million paired-end reads of 150bp per sample). I have 40 million contigs>2000bp (about 200,000 per sample) that I annotated with MMseqs2 (database GTDB).

Because they are contigs I cannot really cluster them together except for the complete organisms (viruses....). Indeed even if the same organism is in 2 samples I do not see any reason why it would get fragmented in the same locations.

If I do not cluster though, primary read mappings are gonna be diluted between similar samples and I fear it will affect significantly the abundances. Indeed if the same organism is present in 2 samples very similar DNA is gonna be repeated twice and the reads are gonna map randomly to one or the other: the abundance of each organism per sample would be divided by 2 and this problem would scale with the number of samples where this organism is present.

Do you think I can run Taxometer despite this problem? Or could I allow multimapping to partly solve it (if Taxometer do not only consider primary reads)?

Also is Taxometer gonna be able to process that many contigs on 1 GPU and in less than 3 days? Over that I would have a problem in terms of available resources ;)

Thanks,

@jakobnissen
Copy link
Member

Taxometer can use either pycoverm to estimate abundance, or input the abundances from a precomputed TSV file. Pycoverm does not use multimapping, and so does indeed dilute out the reads. However, in experiments, we have found that adding multi-mapping reads has no effect. It appears that this dilution is not a problem, at least below one million contigs. However, we have not tested anywhere close to 40M contigs.
Another option is to use strobealign --aemb, which will also be faster. This uses multi-mapping, splitting the reads deterministically as opposed to randomly.
My advice is to give it a try with strobealign --aemb.

I'm not sure it can do 40M contigs in 4 days on a GPU. I would guess yes, but you'll have to give it a go. Good luck!

@jakobnissen
Copy link
Member

I'm going to close this as addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants