Skip to content

nf-core/genomeassembler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-core/genomeassembler

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Introduction

nf-core/genomeassembler is a bioinformatics pipeline that carries out genome assembly, polishing and scaffolding from long reads (ONT or pacbio). Assembly can be done via flye or hifiasm, polishing can be carried out with medaka (ONT), or pilon (requires short-reads), and scaffolding can be done using LINKS, Longstitch, or RagTag (if a reference is available). Quality control includes BUSCO, QUAST and merqury (requires short-reads). Currently, this pipeline does not implement phasing of polyploid genomes or HiC scaffolding.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,ontreads,hifireads,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,ontreads.fa.gz,hifireads.fa.gz,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true

Each row represents one genome to be assembled. sample should contain the name of the sample, ontreads should contain a path to ONT reads (fastq.gz), hifireads a path to HiFi reads (fastq.gz), ref_fasta and ref_gff contain reference genome fasta and annotations. shortread_F and shortread_R contain paths to short-read data, paired indicates if short-reads are paired. Columns can be omitted if they contain no data, with the exception of shortread_R, which needs to be present if shortread_F is there, even if it is empty.

-->

Now, you can run the pipeline using:

nextflow run nf-core/genomeassembler \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pre-set profiles

To ease configuration, there are a couple of pre-defined profiles for various combinations of read sources and assemblers (named readtype_assembler)

ONT HiFI  Assembly-strategy  Profile name
Yes No  flye ont_flye
No Yes  flye hifi_flye
No Yes  hifiasm hifi_hifiasm
Yes Yes  hifiasm --ul hifiont_hifiasm
Yes Yes  Scaffolding of ONT assemblies onto HiFi assemblies hifiont_flyehifiasm

Pipeline specific parameters

Parameter Description Type Default Required Hidden
ont ONT reads available? boolean
hifi HiFi reads available? boolean
short_reads Short reads available? boolean
collect collect ONT reads into a single file boolean
porechop run porechop on ONT reads boolean
lima run lima on HiFi reads? boolean
pacbio_primers file containing pacbio primers for trimming with lima string
trim_short_reads trim short reads with trimgalore boolean
assembler Assembler to use. Valid choices are: 'hifiasm', 'flye', or 'flye_on_hifiasm'. flye_on_hifiasm will scaffold flye assembly (ont) on hifiasm (hifi) assembly using ragtag string
kmer_length kmer length to be used for jellyfish integer
read_length read length for genomescope (ONT only) string
dump dump jellyfish output boolean
meryl_k kmer length for meryl integer
use_ref use reference genome boolean
genome_size expected genome size string
flye_mode flye mode string "--nano-hq"
flye_args additional args for flye string ""
qc_reads Long reads that should be used for QC when both ONT and HiFi reads are provided. Options are 'ONT' or 'HIFI' string "ONT"
hifiasm_ont Use hifi and ONT reads with hifiasm --ul boolean
hifiasm_args Extra arguments passed to hifiasm string ""
polish_pilon Polish assembly with pilon? boolean
polish_medaka Polish assembly with medaka (ONT only) boolean
medaka_model model to use with medaka string 'r1041_e82_400bps_hac_v4.2.0'
scaffold_ragtag Scaffold with ragtag (requires reference)? boolean
scaffold_links Scaffolding with links? boolean
scaffold_longstitch Scaffold with longstitch? boolean
lift_annotations Lift-over annotations (requires reference)? boolean
busco Run BUSCO? boolean
busoc_db Path to busco db string ''
busco_lineage Busco lineage to use string "brassicales_odb10"
quast Run quast boolean
skip_assembly skip assembly steps
HelpSkip assembly and perform only qc.
boolean
skip_alignments skip alignments during qc boolean
jellyfish run jellyfish and genomescope on ONT reads to compute k-mer distribution and estimate genome size boolean
yak run qc via yak boolean

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

nf-core/genomeassembler was originally written by Niklas Schandry (@nschan).

We thank the following people for their extensive assistance in the development of this pipeline:

  • Mahesh Binzer-Panchal (@mahesh-panchal)
  • Matthias Hörtenhuber (@mashehu)
  • Daniel Straub (@d4straub)

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #genomeassembler channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.