-
Notifications
You must be signed in to change notification settings - Fork 216
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support gnomADe AFs; Updated tests; Abandon Travis
- Loading branch information
Showing
11 changed files
with
123 additions
and
114 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,19 @@ | ||
vcf<img src="https://i.giphy.com/R6X7GehJWQYms.gif" width="28">maf | ||
======= | ||
|
||
To convert a [VCF](http://samtools.github.io/hts-specs/) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild. | ||
|
||
[![Build Status](https://travis-ci.com/mskcc/vcf2maf.svg?branch=master)](https://travis-ci.com/mskcc/vcf2maf) | ||
To convert a [VCF](https://samtools.github.io/hts-specs//) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild. | ||
|
||
Quick start | ||
----------- | ||
|
||
Find the [latest stable release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`: | ||
Find the [latest release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`: | ||
|
||
export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4` | ||
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-* | ||
perl vcf2maf.pl --man | ||
perl maf2maf.pl --man | ||
|
||
If you don't have [VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html) installed, then [follow this gist](https://gist.github.com/ckandoth/61c65ba96b011f286220fa4832ad2bc0). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this: | ||
If you don't have VEP installed, then [follow this gist](https://gist.github.com/ckandoth/4bccadcacd58aad055ed369a78bf2e7c). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this: | ||
|
||
perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf | ||
|
||
|
@@ -49,6 +47,37 @@ After tests on variant lists from many sources, `maf2vcf` and `maf2maf` are quit | |
|
||
See `data/minimalist_test_maf.tsv` for a sampler. Addition of `Tumor_Seq_Allele1` will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments `--tum-vad-col` and `--tum-depth-col` are set correctly to the names of columns containing those read counts. Specifying the `Matched_Norm_Sample_Barcode` with its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument `--nrm-vad-col` and `--nrm-depth-col`. | ||
|
||
Docker | ||
------ | ||
|
||
Assuming you have a recent version of docker, clone the main branch and build an image as follows: | ||
|
||
git clone [email protected]:mskcc/vcf2maf.git | ||
cd vcf2maf | ||
docker build -t vcf2maf:main . | ||
docker builder prune -f | ||
|
||
Now you run the scripts in docker as follows: | ||
|
||
docker run --rm vcf2maf:main perl vcf2maf.pl --help | ||
docker run --rm vcf2maf:main perl maf2maf.pl --help | ||
|
||
Testing | ||
------- | ||
|
||
A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows: | ||
|
||
wget -P tests https://data.cyri.ac/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz | ||
gzip -d tests/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz | ||
wget -P tests https://data.cyri.ac/homo_sapiens_vep_112_GRCh38_chr21.tar.gz | ||
tar -zxf tests/homo_sapiens_vep_112_GRCh38_chr21.tar.gz -C tests | ||
|
||
And the following scripts test the docker image on predefined inputs and compare outputs against expected outputs: | ||
|
||
perl tests/vcf2maf.t | ||
perl tests/vcf2vcf.t | ||
perl tests/maf2vcf.t | ||
|
||
License | ||
------- | ||
|
||
|
@@ -57,4 +86,4 @@ License | |
Citation | ||
-------- | ||
|
||
Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6.19. (2020). doi:10.5281/zenodo.593251 | ||
Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6. (2020). doi:10.5281/zenodo.593251 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,7 @@ | |
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count ); | ||
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count ); | ||
my ( $vep_path, $vep_data, $vep_forks, $buffer_size, $any_allele ) = ( "$ENV{HOME}/miniconda3/bin", "$ENV{HOME}/.vep", 4, 5000, 0 ); | ||
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" ); | ||
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" ); | ||
my ( $species, $ncbi_build, $cache_version, $maf_center, $max_subpop_af ) = ( "homo_sapiens", "GRCh37", "", ".", 0.0004 ); | ||
my $perl_bin = $Config{perlpath}; | ||
|
||
|
@@ -41,8 +41,9 @@ | |
MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH | ||
ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj | ||
ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE | ||
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF | ||
gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF ); | ||
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomADe_AF gnomADe_AFR_AF gnomADe_AMR_AF | ||
gnomADe_ASJ_AF gnomADe_EAS_AF gnomADe_FIN_AF gnomADe_NFE_AF gnomADe_OTH_AF gnomADe_SAS_AF | ||
); | ||
|
||
# Check for missing or crappy arguments | ||
unless( @ARGV and $ARGV[0]=~m/^-/ ) { | ||
|
@@ -382,7 +383,7 @@ =head1 OPTIONS | |
--species Ensembl-friendly name of species (e.g. mus_musculus for mouse) [homo_sapiens] | ||
--ncbi-build NCBI reference assembly of variants in MAF (e.g. GRCm38 for mouse) [GRCh37] | ||
--cache-version Version of offline cache to use with VEP (e.g. 75, 84, 91) [Default: Installed version] | ||
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz] | ||
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz] | ||
--help Print a brief help message and quit | ||
--man Print the detailed manual | ||
|
@@ -401,7 +402,6 @@ =head2 Relevant links: | |
=head1 AUTHORS | ||
Cyriac Kandoth ([email protected]) | ||
Qingguo Wang ([email protected]) | ||
=head1 LICENSE | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,7 @@ | |
use Pod::Usage qw( pod2usage ); | ||
|
||
# Set any default paths and constants | ||
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz"; | ||
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz"; | ||
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count ); | ||
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count ); | ||
|
||
|
@@ -357,7 +357,7 @@ =head1 OPTIONS | |
--input-maf Path to input file in MAF format | ||
--output-dir Path to output directory where VCFs will be stored, one per TN-pair | ||
--output-vcf Path to output multi-sample VCF containing all TN-pairs [<output-dir>/<input-maf-name>.vcf] | ||
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz] | ||
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz] | ||
--per-tn-vcfs Specify this to generate VCFs per-TN pair, in addition to the multi-sample VCF | ||
--tum-depth-col Name of MAF column for read depth in tumor BAM [t_depth] | ||
--tum-rad-col Name of MAF column for reference allele depth in tumor BAM [t_ref_count] | ||
|
@@ -376,12 +376,11 @@ =head2 Relevant links: | |
Homepage: https://github.com/ckandoth/vcf2maf | ||
VCF format: http://samtools.github.io/hts-specs/ | ||
MAF format: https://wiki.nci.nih.gov/x/eJaPAQ | ||
MAF format: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format | ||
=head1 AUTHORS | ||
Cyriac Kandoth ([email protected]) | ||
Qingguo Wang ([email protected]) | ||
=head1 LICENSE | ||
|
Oops, something went wrong.