You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to try using taxometer on 200 metagenomes from a similar environment (stream biofilms with about 200 million paired-end reads of 150bp per sample). I have 40 million contigs>2000bp (about 200,000 per sample) that I annotated with MMseqs2 (database GTDB).
Because they are contigs I cannot really cluster them together except for the complete organisms (viruses....). Indeed even if the same organism is in 2 samples I do not see any reason why it would get fragmented in the same locations.
If I do not cluster though, primary read mappings are gonna be diluted between similar samples and I fear it will affect significantly the abundances. Indeed if the same organism is present in 2 samples very similar DNA is gonna be repeated twice and the reads are gonna map randomly to one or the other: the abundance of each organism per sample would be divided by 2 and this problem would scale with the number of samples where this organism is present.
Do you think I can run Taxometer despite this problem? Or could I allow multimapping to partly solve it (if Taxometer do not only consider primary reads)?
Also is Taxometer gonna be able to process that many contigs on 1 GPU and in less than 3 days? Over that I would have a problem in terms of available resources ;)
Thanks,
The text was updated successfully, but these errors were encountered:
Taxometer can use either pycoverm to estimate abundance, or input the abundances from a precomputed TSV file. Pycoverm does not use multimapping, and so does indeed dilute out the reads. However, in experiments, we have found that adding multi-mapping reads has no effect. It appears that this dilution is not a problem, at least below one million contigs. However, we have not tested anywhere close to 40M contigs.
Another option is to use strobealign --aemb, which will also be faster. This uses multi-mapping, splitting the reads deterministically as opposed to randomly.
My advice is to give it a try with strobealign --aemb.
I'm not sure it can do 40M contigs in 4 days on a GPU. I would guess yes, but you'll have to give it a go. Good luck!
Thanks for the great tool!
I want to try using taxometer on 200 metagenomes from a similar environment (stream biofilms with about 200 million paired-end reads of 150bp per sample). I have 40 million contigs>2000bp (about 200,000 per sample) that I annotated with MMseqs2 (database GTDB).
Because they are contigs I cannot really cluster them together except for the complete organisms (viruses....). Indeed even if the same organism is in 2 samples I do not see any reason why it would get fragmented in the same locations.
If I do not cluster though, primary read mappings are gonna be diluted between similar samples and I fear it will affect significantly the abundances. Indeed if the same organism is present in 2 samples very similar DNA is gonna be repeated twice and the reads are gonna map randomly to one or the other: the abundance of each organism per sample would be divided by 2 and this problem would scale with the number of samples where this organism is present.
Do you think I can run Taxometer despite this problem? Or could I allow multimapping to partly solve it (if Taxometer do not only consider primary reads)?
Also is Taxometer gonna be able to process that many contigs on 1 GPU and in less than 3 days? Over that I would have a problem in terms of available resources ;)
Thanks,
The text was updated successfully, but these errors were encountered: