Merge pull request #519 from ga4gh/variation_annotation

Merging the long-awaited variation_annotation branch into master. Kudos to the entire team, with special acknowledgments to @sarahhunt, @pcingola, @andrewjesaitis, @david4096 for contributions. Also, thank you to @pcingola for pushing discussions regarding implementing variant impact. We will pick up that issue again shortly. In the meantime, protobuf and compliance work can continue apace.
ga4gh · Feb 25, 2016 · 8bb1865 · 8bb1865
2 parents 0cc27e0 + 501bf78
commit 8bb1865
Show file tree

Hide file tree

Showing 16 changed files with 2,112 additions and 28 deletions.
diff --git a/doc/source/_static/variant_annotation_schema.gliffy b/doc/source/_static/variant_annotation_schema.gliffy
diff --git a/doc/source/_static/variant_annotation_schema.svg b/doc/source/_static/variant_annotation_schema.svg
diff --git a/doc/source/api/alleleAnnotations.rst b/doc/source/api/alleleAnnotations.rst
@@ -0,0 +1,100 @@
+
+Allele Annotation API
+!!!!!!!!!!!!!!!!!!!!!!
+
+See `Allele Annotation schema <../schemas/alleleAnnotations.html>`_ for a detailed reference.
+
+Introduction
+@@@@@@@@@@@@
+
+Variant alleles can be annotated by comparing them to gene annotation data
+using a variety of algorithms. A standard form of annotation is to compare 
+alleles to a transcript set and calculate the expected functional consequence 
+of the change ( e.g. a variant within a protein coding transcript may change the
+amino acid sequence of the resulting protein).
+
+This API supports the mining of variant annotations by region 
+and the filtering of the results by predicted functional effect.
+
+Allele Annotation Schema Entities
+@@@@@@@@@@@@@@@@@@@@@@@@
+
+The ``VariantAnnotation`` data model, is based on the results provided by variant 
+annotation programs such as VEP, SnpEff and Annovar and others, as well as the 
+VCF's `ANN format <http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf>`_ . 
+
+
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| Record              | Description                                                                                                         |
++=====================+=====================================================================================================================+
+| VariantAnnotationSet| A VariantAnnotationSet record groups VariantAnnotation records. It represents the comparison of a VariantSet to     |
+|                     | specified gene annotation data using specified algorithms. It holds information describing the software and         |
+|                     | annotation data versions used.                                                                                      |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| VariantAnnotation   | A VariantAnnotation record represents the result of comparing a single variant to the set of annotation data. It    |
+|                     | contains structured sub-records and a flexible key-value pair ‘info’ field.                                         |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| TranscriptEffect    | A TranscriptEffect record describes the effect of an allele on a transcript.                                        |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| AlleleLocation      | An AlleleLocation record holds the location of an allele relative to a non-genomic coordinate system such as a CDS  |
+|                     | or protein. It holds the reference and alternate sequence where appropriate                                         |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| HGVSAnnotation      | A HGVSAnnotation record holds Human Genome Variation Society ( `HGVS <http://www.hgvs.org/mutnomen/recs.html>`_ )   |
+|                     | descriptions of the sequence change at genomic, transcript and protein level where relevant.                        |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+| AnalysisResult      | An AnalysisResult record holds the output of a prediction package such as SIFT on a specific allele.                |
++---------------------+---------------------------------------------------------------------------------------------------------------------+
+
+The schema is shown in the diagram below.
+
+.. image:: /_static/variant_annotation_schema.svg
+
+
+TranscriptEffect attributes
+@@@@@@@@@@@@@@@@@@@@@@@@@@@
+
+A ``VariantAnnotation`` record may have many ``TranscriptEffect`` records as one is 
+reported for each possible combination of alternate alleles and overlapping 
+transcripts. The record includes:
+
+* The identifier of the transcript feature the variant was analysed against.
+* The alternate allele of the variant analysed. This is necessary as the current variant model supports multiple alternate alleles.
+* The predicted effects of the allele on the transcript, which should be described using `Sequence Ontology <http://www.sequenceontology.org>`_ terms.
+* A ``HGVSAnnotation`` record containing variant descriptions at all relevant levels. 
+* ``AlleleLocation`` records describing the changes at cDNA, CDS and protein level.
+* A set of results from prediction packages analyzing the allele impact.
+* A summary impact classification reflecting the highest impact consequence.
+
+Predicted Molecular Impact Classification
+@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+
+The predicted molecular impact is a simple prioritization based on the putative
+deleteriousness of the variant allele on the transcript, which is popular with
+users of annotation tools. This is usually calculated based on naive algorithms
+and may not accurately predict true impact at protein level.
+
+Predicted Molecular Impact classification is summarized using the terms:
+
++----------+-----------------------------------------------+-------------------------------------------+
+| Impact   | Meaning                                       | Example SO terms                          |                   
++==========+===============================================+===========================================+
+| HIGH     | Highly likely to disrupt protein function     | splice_donor_variant, stop_gained         |
++----------+-----------------------------------------------+-------------------------------------------+
+| MODERATE | Moderately likely to disrupt protein function | missense_variant, inframe_insertion       |
++----------+-----------------------------------------------+-------------------------------------------+
+| LOW      | Not likely to disrupt protein function        | synonymous_variant, stop_retained_variant |
++----------+-----------------------------------------------+-------------------------------------------+
+| MODIFIER | No predicted effect                           | 3_prime_UTR_variant, intron_variant       |
++----------+-----------------------------------------------+-------------------------------------------+
+
+Search Options
+@@@@@@@@@@@@@@
+
+VariantAnnotationSets can be extracted by Dataset or VariantSet, or retrieved by id.
+
+A VariantAnnotationSet can be searched for VariantAnnotations by region and filters
+can be applied.
+
+* A region to search must be specified. This can be done by providing a reference sequence (identified by name or id) with start and end coordinates.
+* Results can be filtered by the predicted effect of the variant using a Sequence Ontology OntologyTerm.
+
diff --git a/doc/source/api/apidesign_intro.rst b/doc/source/api/apidesign_intro.rst
@@ -4,7 +4,7 @@
 API Design
 !!!!!!!!!!
 
-
+.. _apidesign_object_ids:
 Object Ids
 @@@@@@@@@@
 

diff --git a/doc/source/api/index.rst b/doc/source/api/index.rst
@@ -53,5 +53,15 @@ Metadata allows organizing all the primary data types.
    metadata
 
 
+Allele Annotations
+@@@@@@@@@@@@@@@@@@
+
+Allele annotations are additional pieces of data often generated by
+algorithms which help to describe, classify, and understand variants.
+
+.. toctree::
+   alleleAnnotations
+
+
 .. _SAM/BAM: https://samtools.github.io/hts-specs/SAMv1.pdf
 .. _VCF: https://samtools.github.io/hts-specs/VCFv4.2.pdf
diff --git a/doc/source/api/metadata.rst b/doc/source/api/metadata.rst
@@ -31,6 +31,82 @@ provider, users should not make semantic assumptions about that data.
 Subsets of the data in a dataset can be selected for analysis using
 other metadata or attributes.
 
+.. _metadata_date_time:
+
+Date and Time Format Specifications
+-----------------------------------
+
+Date and time formats are specified as ISO8601 compatible strings, both for
+time points as well as for intervals and durations.
+An optional required granularity may be specified as part of the respective
+attributes' documentations.
+
+Time points
+===========
+
+The specification of a time point is given through the concatenation of
+
+* a date in YYYY-MM-DD
+* the designator "T" indicating a following time description
+* the time of day in HH:MM:SS.SSS form, where "SSS" represents a decimal
+  fraction of a second
+* a time zone offset in relation to UTC
+
+**Examples**
+
+* year (YYYY)
+    2015
+
+* date (e.g. date of birth) in YYYY-MM-DD
+    2015-02-10
+
+* time stamp in milliseconds in YYYY-MM-DDTHH:MM:SS.SSS
+    2015-02-10T00:03:42.123Z
+
+**Implementations**
+
+* created
+* updated
+* many proposed in metadata branch
+
+Durations
+=========
+
+Durations are a specific form of intervals, without reference to time points.
+They are indicated with a leading "P", followed by unit delimited
+quantifiers. A leading "T" is required before the start of the time components.
+Durations do not have to be normalized; "PT50H" is equally valid as "P2T2H".
+
+**Examples**
+
+* age in years in PnY
+    P44Y
+
+* age in years and months in PnYnM
+    P44Y08M
+
+* short time interval (e.g. 30min in experimental time series) in PTnM
+    PT30M
+
+Time intervals
+==============
+
+Time intervals consist of a combination of two time designators. These can be
+either two time points for start and end, or one time point and a leading
+(time point indicates end) or trailing (time point indicates start) duration.
+The time elements are separated by a forward slash "/".
+
+**Examples**
+
+* age with date of birth in YYYY-MM-DD/PnYnMnD
+    1967-11-21/P40Y10M05D
+
+* anchored 3 month interval, e.g. a therapy cycle in YYYY-MM-DD/YYYY-MM-DD
+    2015-04-18/2015-07-17
+
+* experimental intervention of 30min in YYYY-MM-DDTHH:MM/YYYY-MM-DDTHH:MM
+    2014-12-31T23H45M/2015-01-01T00H15M
+
 
 **Use Cases**
 

diff --git a/doc/source/schemas/Makefile b/doc/source/schemas/Makefile
@@ -44,4 +44,4 @@ clean:
 cleaner: clean
 	/bin/rm -f *.avpr
 cleanest: cleaner
-	/bin/rm -f ${RST_BASENAMES}
+	/bin/rm -f ${RST_BASENAMES}