Bottleneck at pre-processing #34

hsiaoyi0504 · 2017-08-14T17:18:52Z

/usr/lib/python3.5/site-packages/quast-4.5-py3.5.egg/EGG-INFO/scripts/quast.py assembly_metrics/sample_data/BCM-After-Atlas/Contigs/Clec_Bbug02212013.contigs.fa.gz

Version: 4.5

System information:
  OS: Linux-3.10.0-327.28.2.el7.x86_64-x86_64-with-centos-7.2.1511-Core (linux_64)
  Python version: 3.5.3
  CPUs number: 4

Started: 2017-08-15 12:59:09

Logging to /home/huei820504/quast_results/results_2017_08_15_12_59_09/quast.log
NOTICE: Maximum number of threads is set to 1 (use --threads option to set it manually)

▽

CWD: /home/huei820504
Main parameters:
  Threads: 1, minimum contig length: 500, ambiguity: one, threshold for extensive misassembly size: 1000

Contigs:
  Pre-processing...
  assembly_metrics/sample_data/BCM-After-Atlas/Contigs/Clec_Bbug02212013.contigs.fa.gz ==> Clec_Bbug02212013.contigs

2017-08-15 13:00:25
Running Basic statistics processor...
  Contig files:
    Clec_Bbug02212013.contigs
  Calculating N50 and L50...
    Clec_Bbug02212013.contigs, N50 = 23541, L50 = 5952, Total length = 513170376, GC % = 34.82, # N's per 100 kbp =  0.00
  Drawing Nx plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/Nx_plot.pdf
  Drawing cumulative plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/cumulative_plot.pdf
  Drawing GC content plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/GC_content_plot.pdf
  Drawing Clec_Bbug02212013.contigs GC content plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/Clec_Bbug02212013.contigs_GC_content_plot.pdf
Done.

NOTICE: Genes are not predicted by default. Use --gene-finding option to enable it.

2017-08-15 13:01:19
Creating large visual summaries...
This may take a while: press Ctrl-C to skip this step..
  1 of 2: Creating Icarus viewers...
  2 of 2: Creating PDF with all tables and plots...
Done

2017-08-15 13:01:34
RESULTS:
  Text versions of total report are saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.txt, report.tsv, and report.tex
  Text versions of transposed total report are saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
  HTML version (interactive tables and plots) saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.html
  PDF version (tables and plots) is saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.pdf
  Icarus (contig browser) is saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/icarus.html
  Log saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/quast.log

Finished: 2017-08-15 13:01:34
Elapsed time: 0:02:25.053717
NOTICEs: 2; WARNINGs: 0; non-fatal ERRORs: 0

Thank you for using QUAST!

If you see the running time of pre-processing:
2017-08-15 12:59:09 to 2017-08-15 13:00:25
That is, it costs 00:01:16 (over half of total running time 0:02:25)
It's wired.

I see there is a write operation at

quast/quast_libs/qutils.py

Line 129 in e0e6212

fastaparser.write_fasta(corrected_fpath, modified_fasta_entries)

,
which maybe can be avoided.

The text was updated successfully, but these errors were encountered:

alexeigurevich · 2017-08-16T13:19:59Z

Hi,
I agree that this looks wired but this example is not a common case. The pipeline usually contains much more time consuming steps, so pre-processing takes just a fraction of total running time. E.g. contig alignment step (if reference genome is specified with -R) or gene prediction (if --gene-finding is specified) are taking a majority of time.
In a simple case like you showed here, one can use --no-check option to skip additional correction of input contigs (it is needed to prevent failing of third-party tools like gene prediction software and sequence aligners).
However, thank you for pointing this issue. Maybe we need to automatically detect "simple runs" and use --no-check by default in these cases.

hsiaoyi0504 · 2017-08-17T08:14:42Z

OK, I understand your opinion, but I think that it still can be accelerated without using --no-check as default. Make --no-check as default is somewhat dangerous. For calculation of many metrics, it can directly use the corrected version of fasta in memory without writing it to disk and read back again.

alexeigurevich · 2017-08-17T08:22:57Z

Hi, I am totally agree that this thing could be improved. However, currently we are working on other parts of the project (new functionality), so if you have time to do this it would be great! Users patches are very welcomed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bottleneck at pre-processing #34

Bottleneck at pre-processing #34

hsiaoyi0504 commented Aug 14, 2017 •

edited

Loading

alexeigurevich commented Aug 16, 2017

hsiaoyi0504 commented Aug 17, 2017 •

edited

Loading

alexeigurevich commented Aug 17, 2017

Bottleneck at pre-processing #34

Bottleneck at pre-processing #34

Comments

hsiaoyi0504 commented Aug 14, 2017 • edited Loading

alexeigurevich commented Aug 16, 2017

hsiaoyi0504 commented Aug 17, 2017 • edited Loading

alexeigurevich commented Aug 17, 2017

hsiaoyi0504 commented Aug 14, 2017 •

edited

Loading

hsiaoyi0504 commented Aug 17, 2017 •

edited

Loading