The tools are developed in Perl, bash, Python3, Java and work on the Linux system and require:
Tools | Website | Version |
---|---|---|
Bamtools | https://github.com/pezmaster31/bamtools | bamtools/2.4.0 |
BWA | http://bio-bwa.sourceforge.net | bwa/0.7.12 |
Cutadapt | https://cutadapt.readthedocs.io/en/stable/ | cutadapt/2.10 |
FastQC | https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | FastQC/0.11.7 |
GATK V4 | https://software.broadinstitute.org/gatk/ | GenomeAnalysisTK/4.1.9.0 |
GATK V3 | https://software.broadinstitute.org/gatk/ | GenomeAnalysisTK/3.7-0 |
Picard Tools | https://broadinstitute.github.io/picard/ | picard-tools/2.7.0 |
sambamba | https://lomereiter.github.io/sambamba/ | sambamba/0.6.6 |
Samtools | https://github.com/samtools/samtools | samtools/1.2 |
STAR | https://github.com/alexdobin/STAR | STAR/2.5.0b |
VCFHunter | https://github.com/SouthGreenPlatform/VcfHunter | |
Vcftools | https://vcftools.github.io/index.html | vcftools/0.1.14 |
For the installation of the differents software, in is under construction. Be carefull, those pipeline are part of a cluster architecture, so change the path of the software inside the scipt, this is a futur improvement of the pipeline.
Sun Grid Engine (SGE) and SLURM job scheduler concepts are quite similar. All the script are developped to use them in SGE systeme HPC. In order to adapt those script to SLURM , one script is added, but not finish in the SLURM folder.
Table of content
- How to cite
- Introduction
- Genomic Complexity Reduction
- DARTseq
- GBSseq
- RADseq
- RnaSeq
- Markers Schema
- Workflow - Quality control, cleaning, mapping, SNPCalling
- Input raw data
- Read Quality check
- Step a : Mapping reads on the reference
- DNA Data
- RNA Data
- Step b : Variant discovery
- Workflow - Molecular karyotype analysis
- Merge datasets
- Filter SNP dataset
- Split VCF by chromosome
- Generate molecular karyotype
- Authors and acknowledgments
- Contact
If you use the second workflow.
Study using the tools :
Ref image :
The package provided comprised 4 programs listed here: The modèle of submission depends on Sun Grid Engine supports. In order to paralelize the pipeline, a jobarray is used for the 4 scripts.
-
Script Dart GBS Rad seq
- DarT_GBS_Rad_seq_paired_end_fastq_to_Vcf_gvcf_jobarray_Total_GATK4.pl
- DarT_GBS_Rad_seq_single_end_fastq_to_Vcf_gvcf_jobarray_Total_GATK4.pl
-
Script RNA seq
- RNA_seq_paired_end_fastq_to_Vcf_gvcf_jobarray_Total_GATK4.pl
- RNA_seq_single_end_fastq_to_Vcf_gvcf_jobarray_Total_GATK4.pl
Usage : perl <Script-Name> -r reference_fasta -x extension_file_to_treat -cu cultivar"
Parameters :
-r reference_fasta
-x extension_file_to_treat
-cu cultivar
Furtur developpmnent, in order to adapt the script job submission to Slurm (Simple Linux Utility for Resource Management) model, new script will be added to the package.
- Quality reads: Control the quality of the raw Fastq file with FASTQC
Description: In order to verify the quality of data reads, FastQC allows to check the quality score of each base and the presence or absence of the adaptor used to build the library. Adaptor depends on the sequencing technology used. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Trim low-quality base with Cutadapt Description: According to the result of FastQC, Cutadapt [20] trims low quality ends and removes adapters (illumina). Website: http://cutadapt.readthedocs.io/en/stable/guide.html.
Fixed Parameters:
-b AGATCGGAAGAGC (universal sequence for Illumina). Sequence of an adapter that was ligated to the 5’ or 3’ end. The adapter itself is trimmed and anything that follows too if located at 3’ end.
-O 7: Minimum overlap length. If the overlap between the read and the adapter is shorter than LENGTH, the read is not modified. This reduces the no. of bases trimmed purely due to short random adapter matches.
-m 30: Discard trimmed reads that are shorter than 30.
--quality-base=64: Assume that quality values are encoded as ascii (quality + QUALITY_BASE). The default (33) is usually correct, except for reads produced by some versions of the Illumina pipeline, where this should be set to 64. (Default: 33)
-q 20,20: Trim low-quality bases from 5' and/or 3' ends of reads before adapter removal. If one value is given, only the 3' end is trimmed. If two comma-separated cutoffs are given, the 5' end is trimmed with the first cutoff, the 3' end with the second. The algorithm is the same as the one used by BWA (see documentation).
The tool generates a trimmed fastq file (*_cutadapt.fastq.gz) files for each accession.
- Check homogeneity of the paired fastq files
In the paired-end data generated, it is possible to verify if both reads are present in the two fastqc files. This step is optional because mapper software BWA or STAR do not accept if read 1 and 2 are not present. This is done with the perl script compare_fastq_paired_v5.pl.
Description: Align reads on a reference genome (e.g. Musa acuminata ‘DH Pahang’), with BWA [21] for DNA, and STAR [22] for RNA.
DNA Data: Map with BWA with default parameters with BWA-MEM
Different types of genomic data such as DArTSeq, GBS and RADseq can be used (Figure 1).
The tool generates a sam (*_.sam) files for each accession.
Website: http://bio-bwa.sourceforge.net/
RNA Data: Mapping with STAR in 2-pass mode
Description: In the 2-pass mapping job, STAR will map the reads twice. In the 1st pass, the novel junctions will be detected and inserted into the genome indices. In the 2nd pass, all reads will be re-mapped using annotated (from the GTF file) and novel (detected in the 1st pass) junctions. While this procedure doubles the run-time, it significantly increases sensitivity to novel splice junctions. In the absence of annotations, this option is strongly recommended.
Reference preparation :
Parameter for STAR Indexe Reference :
The basic options to generate genome indices using STAR are as follows:
--runThreadN: number of threads
--runMode: genomeGenerate mode
--genomeDir: /path/to/store/genome_indices
--genomeFastaFiles: /path/to/FASTA_file
--sjdbGTFfile: /path/to/GTF_file
--sjdbOverhang: readlength -1
STAR
--runMode genomeGenerate
--runThreadN 1
--genomeDir Genome_Index_STAR
--genomeFastaFiles Reference_assembly.fasta
--sjdbGTFfile Reference.gff3
--sjdbGTFtagExonParentTranscript Parent
--sjdbOverhang 100
$star_path/STAR
--genomeDir $genome_indexes_dir
--readFilesIn $forward_without_ext.cutadapt.fastq $reverse_without_ext.cutadapt.fastq
--outSAMunmapped Within
--outFileNamePrefix $filenm_root
--outSAMmapqUnique 255
--twopassMode Basic
--quantMode GeneCounts"
The tool generates a folder for each accession, (names filled in column 3 "genome_name") filled in the configuration file, which contained the SAM files of aligned reads and a .final.out file of mapping statistics for each library. In addition, a (--prefix) folder containing a mapping statistics file (--prefix + mapping.tab) for all accession is generated.
Website: https://github.com/alexdobin/STAR
- Add read group and (accession name from the fq.gz filename) sort BAM with Picard Tools
Description: This step replaces the reads Groups which describe the reads mapped on the reference, the sequencing technology, samples names, and library number are added.
Fixed Parameters:
ID = Read Group identifier (e.g. FLOWCELL1.LANE1)
PU = Platform Unit (e.g. FLOWCELL1.LANE1.UNIT1)
SM = Sample (e.g. DAD)
PL = Platform technology used to produce the read (e.g. ILLUMINA)
LB = DNA library identifier (e.g. LIB-DAD-1)
Website: https://broadinstitute.github.io/picard/
The tool generates a bam (*_rmdup.bam) files for each accession with the RG (Read Group) modified.
- Mark duplicate reads and index BAM with MarkDuplicates from PicardTools
Description: PCR duplicate removal, where PCR duplicates arise from multiple PCR products from the same template molecule binding on the flow cell. These are removed because they can lead to false positive variant calls. Sort the BAM file and mark the duplicated reads. This step is done on RNAseq, not on DarTseq, GBS or RADseq. Website: https://broadinstitute.github.io/picard/
The tool generate a bam (_rmdup.bam) files for each accession with duplicated reads removed. In addition, a file named (--prefix + rmdupstat.tab) file containing duplicate statistics for each accession was generated in the (--prefix) folder.
- Index BAM with Samtools
Description: This step reorder the bam file according to the genome index position. Website: http://samtools.sourceforge.net/ The tool generates a reordered (_reorder.bam) and bai (_reorder.bai) files for each accession,
- Split ‘N CIGAR’ reads with SplitNCigarReads from GATK
Description: Splits reads that contain Ns in their cigar string (e.g. spanning splicing events in RNAseq data). Identifies all N cigar elements and creates k+1 new reads (where k is the number of N cigar elements). The first read includes the bases that are to the left of the first N element, while the part of the read that is to the right of the N (including the Ns) is hard clipped and so on for the rest of the new reads. Used for post-processing RNA reads aligned against the full reference. Website : https://gatk.broadinstitute.org/hc/en-us/articles/360036727811-SplitNCigarReads
The tool generates a split and trimmed (on splicing sites) bam (_trim.bam) and bai (_trim.bai) files for each accession.
- Realign indels with IndelRealigners from GATK (2 steps)
Description: The mapper BWA has some difficulties to manage the alignment close to the Indel. The step is not necessary with HaplotypeCaller but is necessary with UnifierGenotyper. The tool generates a bam (_realigned.bam) and bai (_realigned.bai) files realigned around indel for each accession. This step is done with the GATK version 3.8. Carefull it is the GATK version 3.
The tool generates a bam (_realigned.bam) and bai (_realigned.bai) files realigned around indel for each accession.
- Create a VCF file with HaplotypeCaller from GATK
Description: The HaplotypeCaller is able to call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. Whenever the program encounters a region showing variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call.
Fixed Parameters:
--genotyping_mode DISCOVERY
--variant_index_type LINEAR
--variant_index_parameter 128000
Website : https://gatk.broadinstitute.org/hc/en-us/articles/360036712151-HaplotypeCaller
- Filter VCF file with VariantFiltration from GATK
Description: The VariantFiltration is able to filter the variant position acording parameters. Those filter depend of the type of data and the polyploidie.
Fixed Parameters:
--filterExpression 'MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)'
--filterExpression 'QD < 1.5'
--filterExpression 'DP < 15'
Website : https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration
Description: Get one FASTQ file ready for SNP calling per accession from raw sequence data (fastq.gz files).
USAGE: <Technique>_<Type_of_data>_fastq_to_vcf_job_array_Total_GATK4.pl -r ref.fasta -x fq.gz -cu accession
Fixed Parameters:
-r (string): reference FASTA filename
-x (string): file extension (fq.gz)
-cu (string): cultivar
SNP variants can be assigned to the ancestral genomes in order to plot the genome allele coverage ratio. The method has been developed within the VCFHunter software, https://github.com/SouthGreenPlatform/VcfHunter and can be used with the following procedure
The package provided comprised 8 programs listed here:
-
Script Work with gVcf and Vcf
- CombineVcf.pl
- Gvcf2Vcf.pl
-
Script VcfHunter
- vcfFilter.1.0.py
- vcf2allPropAndCov.py
- Merge datasets
Script name: CombineVcf.pl
Description: The script merges and prepares the final VCF file, this step combines multiple VCF and performs pre-filtering using GATK. The samples to analyze are combined to the reference samples to allow allele assignation. Reference samples are representative genotypes that are relevant to the identification of your samples (i.e. acuminata or balbisiana without admixture). Whenever necessary, such SNP datasets can be downloaded on MGIS (https://www.crop-diversity.org/mgis/gigwa with RADseq_ABB_AB datasets).
USAGE: perl CombineVcf.pl -r reference_fasta -p output_prefix -x extension_file_to_treat
Fixed Parameters:
-r (string): reference FASTA filename
-p (string): VCF output prefix
-x (string): file extension (VCF)
Script name: Gvcf2Vcf.pl
Description: The script merges and prepares the final VCF file, this step combines multiple gVcf and performs pre-filtering using GATK4. The samples to analyze are combined to the reference samples to allow allele assignation. Reference samples are representative genotypes that are relevant to the identification of your samples (i.e. acuminata or balbisiana without admixture). Whenever necessary, such SNP datasets can be downloaded on MGIS (https://www.crop-diversity.org/mgis/gigwa with RADseq_ABB_AB datasets).
USAGE : perl gVCF2VCF.pl -r <reference fasta file assembly> -p <file prefix> -v <version number eg. 4.0, 4.1> -x <extension gvcf> -d <path directory>
Fixed Parameters:
-r (string): reference FASTA filename
-p (string): gVcf output prefix
-v (string): version of GATK
-x (string): file extension (gVCF)
-d (string): path directory
- Filter SNP dataset Description: Filter VCF file based on most common parameters such as the coverage, missing data, MAF (minor allele frequency). The tool keeps bi-allelic sites and removes mono-allelic, tri-allelic, tetra-allelic sites.
USAGE: python3 vcfFilter.1.0.py \
--vcf file.prefiltered.vcf \
--prefix file.filtered \
--MinCov 8 \
--MaxCov 200 \
--MinAl 3 \
--MinFreq 0.05 \
--nMiss 50 \
--names All_names.tab \
--RmAlAlt 1:3:4:5:6:7:8:9:10 \
--RmType SnpCluster
Website: https://github.com/SouthGreenPlatform/VcfHunter/blob/master/tutorial_DnaSeqVariantCalling.md
Fixed Parameters:
--vcf: A standard vcf file
--names: A one column file containing accessions to treat.
--outgroup: (optional) A one column file containing accession names that will not be used for filtering but will remain in the
output file.
--RmType: (optional) Variant status to filter out (several values can be passed in this case they should be separated by ",").
Possible values:
*Values which can be found in the FILTER column: PASS, DP_FILTER, QD_FILTER, SnpCluster,
*Other values: INDELS, SNP, AUTAPO (accession specific variant site).
--RmAlAlt: (optional) Number of alleles at the site in filtered accessions to remove the variant site (several values can be
passed and should be separated by ":"). Values: 1,2,3,...,n
--MinCov: Minimal coverage by accession to keep genotype calling (integer). If the value is lower, genotype will be converted to
unknown for the concerned accession. [Default: 10]
--MaxCov: Maximal coverage by accession to keep genotype calling (integer). If the value is greater, genotype will be converted to
unknown for the concerned accession. [Default: 1000]
--MinFreq: Minimal allele frequency to keep genotype calling (float). If the value is lower, genotype will be converted to unknown
for the concerned accession. [Default: 0.05]
--MinAl: Minimal allele coverage by accession to keep genotype calling (integer). If the value is lower for at least one allele,
genotype will be converted to unknown for the concerned accession. [Default: 3]
--nMiss: Maximal number of missing genotype in a line to keep the line (integer). [Default: 0]
--prefix: The prefix for output files. [Default: WorkOnVcf]
- Split VCF by chromosome
Description: Generate a VCF file for each chromosome with VcfTools [23], in order to obtain a representation (chromosome Painting) of the SNP position along each chromosome.
USAGE: vcftools --vcf file_filtered.vcf --chr chr01 --recode --out batchall_filt_chr01
Parameters:
--chr (string): generate a VCF from a given chromosome.
--recode: generate a new VCF file.
--out (string): output file name
- Generate molecular karyotype
Description: The program allows to perform a chromosome painting for all chromosomes of a given accession. Script name: vcf2allPropAndCov.py
USAGE: python3 vcf2allPropAndCov.py --conf <chromosomes.conf> --origin <origin.conf> --acc <sample_name> --ploidy 2
Parameters:
--conf: Conf file containing vcf location (one per chromosome or a single vcf for all chromosomes),
--origin: A 2 column file containing accession name (col1), origin/group (Col2),
--acc: Accession to work with,
--ploidy: Accession ploidy (integer),
--NoMiss: No missing data are allowed in accessions used to attribute alleles to group,
--all: Allele should be present in all accessions of the group.
The tool generates the 4 following files:
_AlleleOriginAndRatio.tab is a file describing each grouped allele, its origin and the proportion of reads having this allele at the studied position in the accession.
_stats.tab is a file reporting statistics on SNP sites used, sites where an allele is attributed to each group and alleles number attributed to each group in the accession.
_Cov.png is a figure showing read coverage along chromosomes (see figure below for interpretation).
_Ratio.png: is a figure showing grouped allele read ratio along chromosomes (see figure below for interpretation).
This program perform two things based on a vcf. 1) It plots for an accession, the allele coverage along its chromosomes. 2) It identify, based on known ancestral accessions in the vcf, the alleles specific to each group and plot the alleles proportion at a site in the accession along chromosomes.
This work is a collaborative work between Catherine Breton, Yann Huber, Mathieu Rouard with the participation, and the use of scripts from Guillaume Martin (CIRAD) who develops and maintains VCFHunter.
Catherine Breton, Alliance of Bioversity International and CIAT Europe ([email protected])
The Alliance of Bioversity International and the International Center for Tropical Agriculture (CIAT) delivers research-based solutions that harness agricultural biodiversity and sustainably transform food systems to improve people’s lives in a climate crisis. The Alliance is part of CGIAR, a global research partnership for a food-secure future.
https://alliancebioversityciat.org/
https://www.bioversityinternational.org/
https://www.ciat.cgiar.org