Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottleneck at pre-processing #34

Open
hsiaoyi0504 opened this issue Aug 14, 2017 · 3 comments
Open

Bottleneck at pre-processing #34

hsiaoyi0504 opened this issue Aug 14, 2017 · 3 comments

Comments

@hsiaoyi0504
Copy link

hsiaoyi0504 commented Aug 14, 2017

/usr/lib/python3.5/site-packages/quast-4.5-py3.5.egg/EGG-INFO/scripts/quast.py assembly_metrics/sample_data/BCM-After-Atlas/Contigs/Clec_Bbug02212013.contigs.fa.gz

Version: 4.5

System information:
  OS: Linux-3.10.0-327.28.2.el7.x86_64-x86_64-with-centos-7.2.1511-Core (linux_64)
  Python version: 3.5.3
  CPUs number: 4

Started: 2017-08-15 12:59:09

Logging to /home/huei820504/quast_results/results_2017_08_15_12_59_09/quast.log
NOTICE: Maximum number of threads is set to 1 (use --threads option to set it manually)

▽

CWD: /home/huei820504
Main parameters:
  Threads: 1, minimum contig length: 500, ambiguity: one, threshold for extensive misassembly size: 1000

Contigs:
  Pre-processing...
  assembly_metrics/sample_data/BCM-After-Atlas/Contigs/Clec_Bbug02212013.contigs.fa.gz ==> Clec_Bbug02212013.contigs

2017-08-15 13:00:25
Running Basic statistics processor...
  Contig files:
    Clec_Bbug02212013.contigs
  Calculating N50 and L50...
    Clec_Bbug02212013.contigs, N50 = 23541, L50 = 5952, Total length = 513170376, GC % = 34.82, # N's per 100 kbp =  0.00
  Drawing Nx plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/Nx_plot.pdf
  Drawing cumulative plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/cumulative_plot.pdf
  Drawing GC content plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/GC_content_plot.pdf
  Drawing Clec_Bbug02212013.contigs GC content plot...
    saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/basic_stats/Clec_Bbug02212013.contigs_GC_content_plot.pdf
Done.

NOTICE: Genes are not predicted by default. Use --gene-finding option to enable it.

2017-08-15 13:01:19
Creating large visual summaries...
This may take a while: press Ctrl-C to skip this step..
  1 of 2: Creating Icarus viewers...
  2 of 2: Creating PDF with all tables and plots...
Done

2017-08-15 13:01:34
RESULTS:
  Text versions of total report are saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.txt, report.tsv, and report.tex
  Text versions of transposed total report are saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
  HTML version (interactive tables and plots) saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.html
  PDF version (tables and plots) is saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/report.pdf
  Icarus (contig browser) is saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/icarus.html
  Log saved to /home/huei820504/quast_results/results_2017_08_15_12_59_09/quast.log

Finished: 2017-08-15 13:01:34
Elapsed time: 0:02:25.053717
NOTICEs: 2; WARNINGs: 0; non-fatal ERRORs: 0

Thank you for using QUAST!

If you see the running time of pre-processing:
2017-08-15 12:59:09 to 2017-08-15 13:00:25
That is, it costs 00:01:16 (over half of total running time 0:02:25)
It's wired.

I see there is a write operation at

fastaparser.write_fasta(corrected_fpath, modified_fasta_entries)
,
which maybe can be avoided.

@alexeigurevich
Copy link
Contributor

Hi,
I agree that this looks wired but this example is not a common case. The pipeline usually contains much more time consuming steps, so pre-processing takes just a fraction of total running time. E.g. contig alignment step (if reference genome is specified with -R) or gene prediction (if --gene-finding is specified) are taking a majority of time.
In a simple case like you showed here, one can use --no-check option to skip additional correction of input contigs (it is needed to prevent failing of third-party tools like gene prediction software and sequence aligners).
However, thank you for pointing this issue. Maybe we need to automatically detect "simple runs" and use --no-check by default in these cases.

@hsiaoyi0504
Copy link
Author

hsiaoyi0504 commented Aug 17, 2017

OK, I understand your opinion, but I think that it still can be accelerated without using --no-check as default. Make --no-check as default is somewhat dangerous. For calculation of many metrics, it can directly use the corrected version of fasta in memory without writing it to disk and read back again.

@alexeigurevich
Copy link
Contributor

Hi, I am totally agree that this thing could be improved. However, currently we are working on other parts of the project (new functionality), so if you have time to do this it would be great! Users patches are very welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants