Parallel merging, squaring off and ensemble calling for genomic variants. Provide a general framework meant to combine multiple variant calls, either from single individuals, batched family calls, or multiple approaches on the same sample. Splits inputs based on shared genomic regions without variants, allowing independent processing of smaller regions with variant calls. Handles:
- Merging multiple samples, called independently, into a single final VCF file.
- Squaring off multiple samples, called independently, by recalling at all identified genomic positions.
- Ensemble calling for samples called with multiple variant callers. Uses an
intersection based approach, selecting variants present in at least
n
callers, where you choosen
.
This is a work in progress.
The executable bcbio-variation-recall
bash script contains a ready to run jar
file. Pre-built distributions will be available. To create a development version
run make
and the executable will be available in the bin
directory. This
requires leiningen.
Merge multiple VCF files together, running in parallel over genomic regions.
Usage: bcbio-variation-recall merge [options] out-file ref-file vcf-files
out-file: VCF (or bgzipped VCF) file to write merged output to
ref-file: FASTA format genome reference file
vcf-files: VCF files to merge. Can be specified on the command line
or as a text file containing paths to files for processing
Options:
-c, --cores CORES 1 Number of cores to use
-r, --region REGION Genomic region to subset, in samtools format (chr1:100-200) or BED file
-h, --help
Perform squaring off for a set of called VCF files, recalling at no-call positions in each sample.
Usage: bcbio-variation-recall square [options] out-file ref-file [<vcf, bam, cram, or list files>]
out-file: VCF (or bgzipped VCF) file to write merged output to
ref-file: FASTA format genome reference file
<remaining>: VCF files to recall and BAM or CRAM files for each sample. Can be specified
on the command line or as text files containing paths to files
for processing. VCFs can be single or multi-sample and BAM/CRAMs can be in
any order but each VCF sample must have an associated BAM/CRAM file to recall.
The sample names in the VCF file must match read groups in the BAM files.
Options:
-c, --cores CORES 1 Number of cores to use
-m, --caller CALLER freebayes Calling method to use: samtools, freebayes, platypus
-r, --region REGION Genomic region to subset, in samtools format (chr1:100-200) or BED file
-h, --help
Ensemble calling for samples: combine multiple VCF caller outputs into a single callset.
Usage: bcbio-variation-recall ensemble [options] out-file ref-file [<vcf-files or list-files>]
out-file: bgzipped VCF file to write merged output to
ref-file: FASTA format genome reference file
<remaining>: VCF files to include for building a final ensemble callset.
Specify on the command line or as text files containing paths to files.
VCFs can be single or multi-sample.
The input order of VCFs determines extraction preference in the final ensemble output.
Options:
-c, --cores CORES 1 Number of cores to use
-n, --numpass NUMPASS 2 Number of callers a variant should be present in to pass
--names NAMES Comma separated list of names corresponding to VCFs, for annotating output
--nofiltered Remove filtered variants before performing ensemble calls
-h, --help
External software provides the underlying algorithms. This tool is a framework for pulling them together. The following command line programs need to be on your path:
- freebayes
- vcflib
- GATK MIT licensed framework
- vt
- bedtools
- bcftools 1.x
- samtools 1.x
- sambamba
- platypus
The bcbio-nextgen pipeline installs all this software automatically.
The code is freely available under the MIT license.