Skyhawk: An Artificial Neural Network-based discriminator for validating clinically significant genomic variants
Contact: Ruibang Luo
Email: [email protected]
git clone https://github.com/aquaskyline/Skyhawk.git
cd Skyhawk
curl http://www.bio8.cs.hku.hk/skyhawkModels.tbz | tar -jxf -
cd skyhawk/testingData
python ../skyhawk/validateVar.py \
--chkpnt_fn ../trainedModels/illumina-novoalign-2500-tspcrfree-hg001+hg002+hg003+hg004+hg005-hg38/learningRate1e-3.epoch100.learningRate1e-4.epoch200 \
--ref_fn chr21.fa.gz \
--bam_fn chr21:14069696-14269696.bam \
--vcf_fn chr21:14069696-14269696.vcf.gz \
--val_fn validation.out
less validation.out
With the increasing throughput and reliability of sequencing technologies in the recent years, it is getting common that medical doctors rely on sequencing to make better diagnostics for cancers and rare diseases. Among the interpretable results that sequencing can provide, genetic variants that have been reported in renowned databases such as Clinvar and HGMD, with evidence showing that they associate with certain symptoms and therapies, are considered as actionable genetic variant candidates. However, even though the state-of-the-art variant callers achieve better precision and sensitivity, it is still not uncommon for the variant callers to produce false positive variants with somehow a pretty high probability, which makes them unable to tell apart from the true positives. The situation gets even worse when the callers are set to favor sensitivity over precision, which is often the case in clinical practices. The false positives variants, if not being sanitized, will lead to a spurious clinical diagnosis. Instead of relying only on what a variant caller tells, clinical doctors usually verify the correctness of actionable genetic variants by eyes, with the help of IGV
or SAMtools tview
. 'Skyhawk' was designed to expedite the process and save the doctors from eye checking the variant, using a deep neural network based "eye", trained with millions of samples on how a true variant "looks like" in different settings.
Skyhawk requires Python 2.7.
Make sure you have Tensorflow ≥ 1.0.0 installed, the following commands install the lastest CPU version of Tensorflow:
pip install tensorflow
pip install blosc
pip install intervaltree
pip install numpy
To check the version of Tensorflow you have installed:
python -c 'import tensorflow as tf; print(tf.__version__)'
Skyhawk runs on CPU. Although Skyhawk can benefit from using GPU, the speed up is insignificant. This is because the feed-forward network computation is light compare with other tasks including parsing reads from a BAM file and formatting the results into a VCF file. We suggest disabling GPU temperarily on a GPU-enabled system when running Skyhawk with command export CUDA_VISIBLE_DEVICES="-1"
.
Without a change to the code, using PyPy python interpreter on Skyhawk modules including dataPrepScripts/ExtractVariantCandidates.py
and dataPrepScripts/CreateTensorSites.py
gives a 5-10 times speed up. Pypy python interpreter can be installed by apt-get, yum, Homebrew, MacPorts, etc. If you have no root access to your system, the official website of Pypy provides a portable binary distribution for Linux. Following is a rundown extracted from Pypy's website (pypy-5.8 in this case, you might be working on a newer version) on how to install the binaries.
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.8-1-linux_x86_64-portable.tar.bz2
tar -jxf pypy-5.8-1-linux_x86_64-portable.tar.bz2
cd pypy-5.8-linux_x86_64-portable/bin
./pypy -m ensurepip
./pip install -U pip wheel intervaltree blosc
# Use pypy as an inplace substitution of python to run the scripts in dataPrepScripts/
If you can use apt-get or yum in your system, please install both pypy
and pypy-dev
packages. And then install the pip for pypy.
sudo apt-get install pypy pypy-dev
wget https://bootstrap.pypa.io/get-pip.py
sudo pypy get-pip.py
sudo pypy -m pip install blosc
sudo pypy -m pip install intervaltree
Skywawk will fallback to using python if pypy intepreter doesn't exist. It's OK to run Skyhawk without pypy, albeit it runs a few times slower (13 minutes vs. 3 minutes on 50k variants).
Pypy is an awesome Python JIT intepreter, you can donate to the project.
wget 'http://www.bio8.cs.hku.hk/testingData.tar'
tar -xf testingData.tar
python ./skyhawk/validateVar.py \
--chkpnt_fn ./trainedModels/illumina-novoalign-2500-tspcrfree-hg001+hg002+hg003+hg004+hg005-hg38/learningRate1e-3.epoch100.learningRate1e-4.epoch200 \
--ref_fn ../testingData/chr21/chr21.fa \
--bam_fn ../testingData/chr21/chr21.bam \
--vcf_fn ../testingData/chr21/chr21.vcf \
--thread 4 \
--val_fn validationOutput.txt
less -S validationOutput.txt
Skyhawk usually takes less than a minute to validate ten thousand variants. In clinical context, especially when using Whole Exome Sequencing, the number variants to be validated would seldom exceed ten thousand. But if you use Skyhawk on million of variants for general variant filtering, you will need to split the run into chromosomes and run them in parallel.
# Index the input VCF for random retrival
bgzip input.vcf
tabix -fp vcf input.vcf.gz
# Split the input VCF into chromosomes, please use appropriate chromosome names (w/ or w/o the "chr" prefix)
for i in {1..22} X Y; do tabix -pvcf input.vcf.gz chr$i > chr$i.vcf; done
# Run validateVar.py in parallel
for i in {1..22} X Y; do \
python ./skyhawk/validateVar.py \
--chkpnt_fn ./trainedModels/illumina-novoalign-2500-tspcrfree-hg001+hg002+hg003+hg004+hg005-hg38/learningRate1e-3.epoch100.learningRate1e-4.epoch200 \
--ref_fn ../hg38.fa \
--bam_fn ../aln.bam \
--vcf_fn chr$i.vcf \
--thread 4 \
--val_fn chr$i.out.txt &\true;
done
This will speed up Skyhawk to less than an hour for 3.5 million variants.
Please refer to the Clairvoyante project.
There are nine columns in the output:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
Validation result | Skyhawk Quality | Chromosome | Position | Reference Allele | Input Alternative Allele | Input Genotype | Skyhawk Alternative Allele | Skyhawk Genotype |
- There are four types of validation result:
- M: Validated variant
- X: Suspecious and unvalidated variant, manual review required
- B: Variant with multiple alternative allele, manual review required
- S: No read cover in the input BAM
Run the program to get the parameter details.
dataPrepScripts/ |
Data Preparation Scripts. Outputs are gzipped unless using standard output. Scripts in this folder are compatible with pypy . |
---|---|
GetTruth.py |
Extract the variant positions and details from a truth VCF. Input: VCF. |
CreateTensorSites.py |
Create tensorflow tensors for variants. |
skyhawk/ |
Script for variant validation. Scripts in this folder are NOT compatible with pypy . Please run with python . |
---|---|
validateVar.py |
Main program for validating variants. |
clairvoyante_test.py |
Code to utilize the Clairvoyante artificial neural network. |
The trained models are in the trainedModels/
folder.
Folder | Tech | LibPrep | Aligner | Ref | Sample |
---|---|---|---|---|---|
illumina-novoalign-2500-tspcrfree- hg001+hg002+hg003+hg004+hg005-hg38 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg38 | NA12878+NA24385+NA24149 +NA24143+NA24631 |
illumina-novoalign-2500-tspcrfree- hg001+hg002+hg003+hg004+hg005-hg19 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg19 | NA12878+NA24385+NA24149 +NA24143+NA24631 |
illumina-novoalign-2500-tspcrfree- hg001+hg002+hg003+hg004-hg38 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg38 | NA12878+NA24385+NA24149 +NA24143 |
illumina-novoalign-2500-tspcrfree- hg001+hg002-hg38 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg38 | NA12878+NA24385 |
illumina-novoalign-2500-tspcrfree- hg001-hg38 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg38 | NA12878 |
illumina-novoalign-2500-tspcrfree- hg002-hg38 |
Illumina HiSeq2500 | TruSeq PCR-free | Nonoalign 3.02.07 | hg38 | NA24385 |
illumina-isaac-2000-tspcrfree- hg001-hg19 |
Illumina HiSeq2000 | TruSeq PCR-free | Isaac | hg19 | NA12878 |
illumina-isaac-4000-tsnano550- hg001-hg19 |
Illumina HiSeq4000 | TruSeq Nano 550 | Isaac | hg19 | NA12878 |
illumina-isaac-x-tsnano2.5- hg001-hg19 |
Illumina HiSeq X | TruSeq Nano 2.5 | Isaac | hg19 | NA12878 |
illumina-isaac-ns500-tsnano350- hg001-hg19 |
Illumina NextSeq 500 | TruSeq Nano 350 | Isaac | hg19 | NA12878 |
illumina-isaac-s1xp-tsnano350- hg001-hg38 |
Illumina NovaSeq S1xp | TruSeq Nano 350 | Isaac | hg38 | NA12878 |
illumina-isaac-s2-nextraflex- hg001-hg38 |
Illumina NovaSeq S2 | TruSeq Nextera Flex | Isaac | hg38 | NA12878 |
illumina-isaac-s4-nextraflex- hg001-hg38 |
Illumina NovoSeq S4 | TruSeq Nextera Flex | Isaac | hg38 | NA12878 |
* Each folder contains one or more models. Each model contains three files suffixed data-00000-of-00001
, index
and meta
, respectively. Only the prefix is needed when using the model with Clairvoyante. Using the prefix learningRate1e-3.epoch999.learningRate1e-4.epoch1499
as an example, it means that the model has trained for 1000 epochs at learning rate 1e-3, then another 500 epochs at learning rate 1e-4. Lambda for L2 regularization was set the same as learning rate.
- Experiments in issue (#4) show that using the Novoalign models for BWA alignments will increase the accuracy of Skyhawk.
The testing dataset 'testingData.tar' includes:
- the Illumina alignments of chr21 and chr22 on GRCh38 from GIAB Github, downsampled to 50x.
- the truth variants v3.3.2 from GIAB.
Skyhawk doesn't support validating variants with more than one alternative allele. Skyhawk will mark the variants with multiple alleles as type 'B' in the results, suggesting these variants were not validated by Skyhawk and require manual validation. Although we will further extend the Skyhawk to support genome variants with multiple alternative alleles, as there are just few number of them and they are more error-prone, we suggest we always review these variants manually.