You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm trying to run Metaquast on a mock community metagenome of 8 organims. The assembly and combined reference genomes are ~70Mbp each. I have 66Gb of uncompressed sequence reads in my forward and reverse FASTQ files. Everything runs well for most of the pipeline but disk usage gets out of control when it begins running quast per reference. With all of the SAM and BAM files I have ~250Gb for each reference its running in parallel, which is currently just 4, so 1Tb of space is being used.
Would it make more sense to instead run quast sequentially for each reference, giving the single quast command all the threading capacity specified in the Metaquast command? I think this model would scale much better if metagenomic samples had tens or hundreds of organisms in them than the current parallel model, even if it is slightly less efficient.
Alternatively, a flag (--cleanup?) could be selected to remove intermediate files (e.g. sam, all.correct.sam, bam) as it runs? This will of course restrict the potential to continue from a failed run but at least there won't be Tbs of SAM files sitting around.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi, I'm trying to run Metaquast on a mock community metagenome of 8 organims. The assembly and combined reference genomes are ~70Mbp each. I have 66Gb of uncompressed sequence reads in my forward and reverse FASTQ files. Everything runs well for most of the pipeline but disk usage gets out of control when it begins running quast per reference. With all of the SAM and BAM files I have ~250Gb for each reference its running in parallel, which is currently just 4, so 1Tb of space is being used.
Would it make more sense to instead run quast sequentially for each reference, giving the single quast command all the threading capacity specified in the Metaquast command? I think this model would scale much better if metagenomic samples had tens or hundreds of organisms in them than the current parallel model, even if it is slightly less efficient.
Alternatively, a flag (--cleanup?) could be selected to remove intermediate files (e.g. sam, all.correct.sam, bam) as it runs? This will of course restrict the potential to continue from a failed run but at least there won't be Tbs of SAM files sitting around.
Thanks!
The text was updated successfully, but these errors were encountered: