Assembly pipeline used by the Brotman Baty Institute to build consensus genomes for sequenced samples.
This pipeline was originally written for assembly of Seattle Flu Study flu genomes, but has been modified to support assembly of SARS-CoV-2, RSV, and flu genomes sequenced with MIPs technology (CRIseq).
Adapted from Louise Moncla's Illumina pipline for influenza snp calling.
Install dependencies via Conda by running:
conda env create -f envs/seattle-flu-environment.yaml
Once installed, the environment needs to be activated by running:
conda activate seattle-flu
The pipeline expects 2 or more FASTQ files for each individual sample, with Read 1 (R1) and Read 2 (R2) from 1 or more lanes. Merging across lanes will be performed automatically.
FASTQs should already be trimmed of their 5’ MIP targeting arms, which can be performed during demultiplexing with bcl2fastq.
Expected filename format: 318375_S12_L001_R1_001.fastq.gz
where 318375
is the NWGC sample ID.
Currently, config/config.criseq.json
has the most updated version of the configuration needed to run assembly.
Recommended file structure:
Create a work directory, which contains a subdirectory with input fastq files.
The output directory can be specified in the config file and will be created as another subdirectory in the work directory.
Edit the following parameters in the config file:
work_dir
: path to the work directory, which will contain both input and output subdirectoriesinput_subdir
: relative path from the work directory to the subdirectory containing input fastqs,assembly_subdir
: relative path from the work directory to the output subdirectory (will be created by Snakemake if it does not already exist)ignored_samples
: an object containing keys that specify samples to be ignored (optional)sample_reference_pairs
: an object containing specific sample/reference pairs, with the keys refering to the samples and an array of references as the values. (optional; if left empty, the pipeline will attempt to create an assembly for every reference inreference_viruses
for each sample)reference_virusus
: an object containing keys that specify references to be usedparams
: parameters for the various tools used in the pipeline
The Snakemake pipeline generates one consensus FASTA for each assembly. To run the Snakemake pipeline:
snakemake --configfile config/config.criseq.json -k --cores <number_of_cores> --rerun-incomplete --snakefile Snakefile-cri
To combine all assemblies into one FASTA per pathogen and gather whole genome sequence metrics with Picard, run these 3 scripts manually after Snakemake. The first 2 steps should be run in the seattle-flu conda environment, and the last step should be run in base conda, with Picard installed locally.
# conda activate seattle-flu (if not already activated)
bash scripts/fasta_to_single_line.sh
bash scripts/sort_index_bams.sh
conda deactivate
bash scripts/picard.sh
Using bowtie2-build to build a Bowtie index for each reference genome. These will later be used in the mapping step.
Concatenates FASTQ files for each sample into 2 files (R1 and R2).
Trim raw FASTQ files with Trimmomatic, which cuts out adapter/illumina-specific sequences, trims the reads based on quality scores, and removes short reads. Trim MIP 3’ overhangs with NGmerge.
Generates summary statistics using FastQC. Allows for some quality control checks on raw sequence data coming from high throughput sequencing pipelines.
Map trimmed reads to reference genome using bowtie2. Outputs a BAM file that represents aligned sequences and a log file that contains the alignment summary.
Sort the BAM file into "genome order" using samtools sort.
Generates statistics for coverage using BAMStats.
Checkpoints allow for data-dependent conditional execution, forcing Snakemake to re-evaluate the DAG at this point.
Determines if all segments within the reference genome has coverage summary generated from BAMStats.
Then calculates the minimum number of reads required to meet the minimum coverage depth set for varscan mpileup2snp
and checks the number of mapped reads from the bowtie2 alignment summary. The minimum number of reads required is calculated as:
min_reads = (reference_genome_length * min_cov) / (raw_read_length)
If not all segments are represented or if mapped_reads
is less than min_reads
, then stop the process for this sample/reference pair.
This rule is necessary for the checkpoint above to work, because it must send the data down one of two paths. This is the "dead end" where a consensus genome is not generated.
Also a placeholder for how to handle sample/reference pairs that do not meet the min_align_rate
threshold. Currently it just outputs the overall align rate of the sample/reference pair so that we can check that it did not meet the minimum.
Generates Pileup for BAM file using samtools mpileup.
Important Flags:
-a
to print out all positions, including zero depth (this is necessary for generating the low coverage Bedfile later)-A
to not discard anomalous read pairs-Q
to set the minimum base quality to consider a read (set to match the minimum invarscan mpileup2snp
so that their coverage depths match)
Creates a BED file for positions that need to be masked in the consensus genome. Positions need to be masked if they are below the minimum coverage depth.
Calls variants (SNPs/indels) from the Pileup based on parameters set in the config file using varscan mpileup2cns
Compress VCF using bgzip, which allows indexes to be built against the file and allows the file be used without decompressing it.
Creates index for compressed VCF using bcftools index. This index is necessary for creating the consensus genome using the compressed VCF.
Create consensus genome by applying VCF variants to the reference genome using bcftools consensus.
This does not account for coverage, so we pass it the BEDfile of low coverage sites with the --mask
option to mask these sites with N
.
Edits the FASTA headers to fit the pattern needed for downstream analysis.
Example FASTA header: >SampleID|SampleID-PB2|H1N1pdm|PB2
- Replace reference sequence name with the NWGC sample ID using Perl to perform "lookaround" regex matches
This is the last rule of the pipeline that prints out the final result of each sample/reference pair.
The input of this rule differs based on the result of the checkpoint, so this rule dictates the final outcome of each sample/reference pair.