Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk usage with Metaquast #106

Open
cmorganl opened this issue Jul 10, 2019 · 0 comments
Open

Disk usage with Metaquast #106

cmorganl opened this issue Jul 10, 2019 · 0 comments

Comments

@cmorganl
Copy link

Hi, I'm trying to run Metaquast on a mock community metagenome of 8 organims. The assembly and combined reference genomes are ~70Mbp each. I have 66Gb of uncompressed sequence reads in my forward and reverse FASTQ files. Everything runs well for most of the pipeline but disk usage gets out of control when it begins running quast per reference. With all of the SAM and BAM files I have ~250Gb for each reference its running in parallel, which is currently just 4, so 1Tb of space is being used.

Would it make more sense to instead run quast sequentially for each reference, giving the single quast command all the threading capacity specified in the Metaquast command? I think this model would scale much better if metagenomic samples had tens or hundreds of organisms in them than the current parallel model, even if it is slightly less efficient.

Alternatively, a flag (--cleanup?) could be selected to remove intermediate files (e.g. sam, all.correct.sam, bam) as it runs? This will of course restrict the potential to continue from a failed run but at least there won't be Tbs of SAM files sitting around.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant