A Snakemake workflow for the identification of variants in bacterial genomes using nanopore long-read sequencing.
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
This workflow provides a simple and easy-to-use framework for the identification of structural and small nucleotide variants in bacterial genomes using nanopore long-read sequencing data.
The snakemake-ont-bacterial-variants
workflow is built using snakemake and consists of the following steps:
- Quality check of sequencing data (
FastQC
) - Filtering of input sequencing data by read length and quality (
Filtlong
) - Mapping to reference genome (
NGMLR
) - Calling of structural and single nucleotide variants (SVs:
cuteSV
andSniffles2
; SNVs:Clair3
andMedaka
) - Filtering of identified variants (e.g., by variant quality or genomic regions;
BCFtools
andVCFtools
) - Generate report with final results (
R markdown
,igv-reports
, andMultiQC
)
If you would like to contribute, report issues, or suggest features, please get in touch on GitHub.
Step 1: Install snakemake with conda
in a new conda environment.
conda create -n <ENV> snakemake pandas
Step 2: Activate conda environment with snakemake
conda activate <ENV>
Important note:
All other dependencies for the workflow are automatically pulled as conda
environments by snakemake, when running the workflow with the --use-conda
parameter (recommended).
When run without automatically built conda
environments, all packages need to be installed manually:
NanoPlot
MultiQC
Filtlong
NGMLR
minimap2
samtools
bedtools
Medaka
Clair3
cuteSV
Sniffles2
bcftools
VCFtools
r-tidyverse
r-rmarkdown
r-dt
igv-reports
The workflow requires the following files to be located in the data
directory:
- Whole-genome sequencing data in
*.fastq.gz
format indata/fastq
- Reference genome(s) in
*.fa
format indata/reference
Optionally, users can provide:
- Reference genome annotation in
*.gff
format indata/annotation
(for feature annotation in IGV report) - A
*.bed
file with genomic regions to ignore for variant calling indata/masked_region
Please ensure that the chromosome names in *.gff
and *.bed
files are the same as in the reference genome.
Input data files are provided in the samples.tsv
table, whose location is inidcated in the config.yml
file. The samplesheet must comply with the following structure:
sample
defines the sample name that will be used throughout the workflow and thus needs to be unique.fastq
provides the path to the sample's*.fastq.gz
file.reference
provides the path to the reference genome*.fa
file (may be the same for several / all samples).annotation
provides the path to the optional reference genome annotation in*.gff
file (may be the same for several / all samples). If no annotation is provided, you must entern/a
!masked_regions
provides the path to the optional*.bed
file for filtering genomic regions (may be the same for several / all samples). If no*.bed
file is provided, you must entern/a
!
sample | fastq | reference | annotation | masked_regions |
---|---|---|---|---|
<sample1> | data/fastq/<fastq1>.fastq.gz | data/reference/<ref1>.fa | data/annotation/<anno1>.gff | data/masked_region/<region1>.bed |
<sample2> | data/fastq/<fastq2>.fastq.gz | data/reference/<ref2>.fa | data/annotation/<anno2>.gff | data/masked_region/<region2>.bed |
... | ... | ... | ... | ... |
<sampleN> | data/fastq/<fastqN>.fastq.gz | data/reference/<refN>.fa | data/annotation/<annoN>.gff | data/masked_region/<regionN>.bed |
Before executing the workflow, you may want to adjust several options and parameters in the default config file config/config.yml
:
- Directories:
indir
: Input directory for all input files,data
by default (see above)outdir
: Output directory (relative to working directory),results
by default
- Sample information:
samples
: Path to samplesheet (relative to working directory),samplesheet/samples.tsv
by defaultlibprepkit
: Kit from ONT used for library preparation, e.g.SQK-NBD114.24
basecalling_model
: Model used for basecalling of raw sequencing data (required for variant calling usingMedaka
), currently supported models are:r1041_e82_400bps_sup_v4.2.0
r1041_e82_400bps_sup_v4.3.0
- Tool parameters:
- The number of cores can be adjusted here for the following tools:
NGMLR
,NanoPlot
,MultiQC
,Medaka
,Clair3
,Sniffles2
, andcuteSV
- You may further adjust the run parameters for the following tools (please refer to the reference provided for more details on run parameters):
Filtlong
: By default, reads are filtered for a minimum length of 500 bp and a mean accuracy of at least 90% (Q10), with 90% of the longest and highest-quailty reads to be kept.Clair3
: Variants are called on all contigs in a haploid-sensitive, ONT-specific mode using--include_all_ctgs --haploid_sensitive --platform ont
.cuteSV
: Variants are called with the suggested parameters for ONT data (--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3
) and the genotyping option enabled (--genotype
).
- The number of cores can be adjusted here for the following tools:
- Filtering of variants:
- The variant quality thresholds can be adjusted here for all four variant callers
remove_common_variants
: IfTrue
, variants which have been identified in all samples with the same reference genome by one tool are filtered out. This is helpful in case all samples derive from a strain, whose genome sequence already differs from the used reference sequence. IfFalse
, all variants are reported.
- Reporting options:
igv_region_length
: Neighboring variants with a maximum bp distance indicated here [1 by default] are reported in one region in the IGV variant report. Increasing this parameter will reduce the file size of the resulting IGV HTML report, if hotspots / regions with many variants exist in a sample.
To run the workflow from the command line, change the working directory:
cd /path/to/snakemake-ont-bacterial-variants
Before running the entire workflow, you may perform a dry run using:
snakemake --dry-run
To run the complete workflow with your files using conda
, execute the following command (the definition of the number of cores is mandatory):
snakemake --cores 10 --use-conda
You may also run the workflow on the provided test data using:
snakemake --cores 10 --use-conda --directory .test
The most important output files and reports are found in variant_reports
for each sample in a separate sub-directory variant_reports/<sample>/
:
<sample>_overview.html
: Custom overview HTML report comprising read and mapping statistics as well as identified variants.<sample>_IGV.html
: HTML report with alignment data for all variants listed in<sample>_overview.html
(generated withigv-reports
).<sample>.<tool>.vcf
: File with all variants for each tool used (after filtering).
Further, variants_reports
contains MultiQC.html
, an interactive HTML report generated by MultiQC
with read statistics on raw, filtered and aligned reads.
Log files from the generation of above reports can be found in variant_reports/logs/
:
<sample>.collect_vcfs.log
: Copying of variant files to above destinations<sample>.igv_reports.log
: Generation of IGV HTML report<sample>.prepare_vcfs.log
: Merging of neighboring variants in genomic regions forigv-reports
<sample>.report.log
: Generation of final overview HTML report using R Markdown
In addition, the workflow generates the following output files in the corresponding directories:
filtered_reads
<sample>.fastq.gz
: Length- and quality-filtered readslogs/<sample>.log
: Filtering log file(s)
mapping
<sample>.bam
: Alignment file of filtered reads to sample-specific reference genome (produced byNGMLR
)<sample>.bam.bai
: Alignment index filelogs/<sample>.*.log
: Log files forNGMLR
,samtools
, andbedtools
(calculation of genome coverage and read end distributions [as temporary files only])
qc
Read statistics on all raw, filtered and aligned reads are found here (generated with NanoPlot
):
raw_reads/<sample>/
: Directory with output files with read statistics on all raw reads (log files infiltered_reads/logs/<sample>.log
).filtered_reads/<sample>/
: Directory with output files with read statistics on all filtered reads (log files infiltered_reads/logs/<sample>.log
).aligned_reads/<sample>/
: Directory with output files with read statistics on all aligned reads (log files inaligned_reads/logs/<sample>.log
).
The raw output from MultiQC
- based on above NanoPlot
results - is located in the directory multiqc
, containing the interactive HTML report multiqc/multiqc_report.html
.
SNV
For both tools (medaka
and clair3
):
<tool>/<sample>/
: Directory with all raw output files from variant calling tool<tool>/<sample>.vcf
: File with identified variants<tool>/<sample>.filtered.vcf
: File with variants filtered by variant quality, overlap with other samples from same reference genome (only ifremove_common_variants: True
inconfig.yml
), and by genomic region if provided (comparemasked_region
in samplesheet)<tool>/logs/<sample>.*
: Standard error (stderr
) and standard output (stdout
) files for the identification of variants using eithermedaka
orclair3
<tool>/logs/<sample>_filtering.log
: Log file produced byVCFtools
for final filtering step to generate<tool>/<sample>.filtered.vcf
If remove_common_variants: True
in config.yml
, the following files are additionally produced:
<tool>/common_variants.vcf
: File with variants called by the tool which are shared by all samples with the same reference genome<tool>/common_variants.txt
: Tab-delimited file deduced from<tool>/common_variants.vcf
listing contig and variant start position only<tool>/logs/<sample>_indexing.*
: Standard error (stderr
) and standard output (stdout
) for indexing of*.vcf
files usingbcftools
<tool>/logs/bcftools.*
: Standard error (stderr
) and standard output (stdout
) for identification of shared variants across samples uisngbcftools
For clair3
, data from the downloaded model is additionally found in the directory clair3/model/
.
SV
For both tools (cutesv
and sniffles2
):
<tool>/<sample>.vcf
: File with identified variants<tool>/<sample>.filtered.vcf
: File with variants filtered by variant quality, overlap with other samples from same reference genome (only ifremove_common_variants: True
inconfig.yml
), and by genomic region if provided (comparemasked_region
in samplesheet)<tool>/logs/<sample>.log
: Log file for the identification of variants using eithercutesv
orsniffles2
<tool>/logs/<sample>_filtering.log
: Log file produced byVCFtools
for final filtering step to generate<tool>/<sample>.filtered.vcf
If remove_common_variants: True
in config.yml
, the following files are additionally produced:
<tool>/common_variants.vcf
: File with variants called by the tool which are shared by all samples with the same reference genome<tool>/common_variants.txt
: Tab-delimited file deduced from<tool>/common_variants.vcf
listing contig and variant start position only<tool>/logs/<sample>_indexing.*
: Standard error (stderr
) and standard output (stdout
) for indexing of*.vcf
files usingbcftools
<tool>/logs/bcftools.*
: Standard error (stderr
) and standard output (stdout
) for identification of shared variants across samples uisngbcftools
For cutesv
, all raw output files from the initial variant calling can be found in the directory cutesv/<sample>/
.
Dr. Thomas Fabian Wulff
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID: https://orcid.org/0000-0001-7166-0899
Please also visit the MPUSP GitHub page at https://github.com/MPUSP for further information on this workflow and other projects!
The following software and tools have been used in this workflow:
BCFtools
Danecek, P., Bonfield, J.K., Liddle, J. et al. Twelve years of SAMtools and BCFtools. GigaScience 10(2), giab008, 2021. (https://doi.org/10.1093/gigascience/giab008)
bedtools
Quinlan, A.R., Hall, I.M., BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841-842, 2010. ( https://doi.org/10.1093/bioinformatics/btq033)
Clair3
Zheng, Z., Li, S., Su, J. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science 2, 797–803, 2022. (https://doi.org/10.1038/s43588-022-00387-x)
cuteSV
Jiang, T., Liu, Y., Jiang, Y. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biology 21, 189, 2020. (https://doi.org/10.1186/s13059-020-02107-y)
Filtlong
igv-reports
MultiQC
Ewels, P., Magnusson, M., Lundin, S., et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19) 3047–3048, 2016. (https://doi.org/10.1093/bioinformatics/btw354)
NanoPlot
Coster, W.D., Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39(5), btad311, 2023. (https://doi.org/10.1093/bioinformatics/btad311)
NGMLR
Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods 15, 461–468, 2018. (https://doi.org/10.1038/s41592-018-0001-7)
SAMtools
Li, H., Handsaker, B., Wysoker, A. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079, 2009. (https://doi.org/10.1093/bioinformatics/btp352)
Snakemake
Mölder, F., Jablonski, K.P., Letcher, B. et al. Sustainable data analysis with Snakemake. F1000Research 10:33, 2021. (https://doi.org/10.12688/f1000research.29032.2)
Sniffles2
Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology 2024. (https://doi.org/10.1038/s41587-023-02024-y)
VCFtools
Danecek, P., Auton, A., Abecasis, G. et al. The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158, 2011. (https://doi.org/10.1093/bioinformatics/btr330)