Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

Commit

Permalink
Merge pull request #519 from ga4gh/variation_annotation
Browse files Browse the repository at this point in the history
Merging the long-awaited variation_annotation branch into master.

Kudos to the entire team, with special acknowledgments to @sarahhunt, @pcingola, @andrewjesaitis, @david4096 for contributions.

Also, thank you to @pcingola for pushing discussions regarding implementing variant impact. We will pick up that issue again shortly. In the meantime, protobuf and compliance work can continue apace.
  • Loading branch information
reece committed Feb 25, 2016
2 parents 0cc27e0 + 501bf78 commit 8bb1865
Show file tree
Hide file tree
Showing 16 changed files with 2,112 additions and 28 deletions.
1 change: 1 addition & 0 deletions doc/source/_static/variant_annotation_schema.gliffy

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions doc/source/_static/variant_annotation_schema.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
100 changes: 100 additions & 0 deletions doc/source/api/alleleAnnotations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@

Allele Annotation API
!!!!!!!!!!!!!!!!!!!!!!

See `Allele Annotation schema <../schemas/alleleAnnotations.html>`_ for a detailed reference.

Introduction
@@@@@@@@@@@@

Variant alleles can be annotated by comparing them to gene annotation data
using a variety of algorithms. A standard form of annotation is to compare
alleles to a transcript set and calculate the expected functional consequence
of the change ( e.g. a variant within a protein coding transcript may change the
amino acid sequence of the resulting protein).

This API supports the mining of variant annotations by region
and the filtering of the results by predicted functional effect.

Allele Annotation Schema Entities
@@@@@@@@@@@@@@@@@@@@@@@@

The ``VariantAnnotation`` data model, is based on the results provided by variant
annotation programs such as VEP, SnpEff and Annovar and others, as well as the
VCF's `ANN format <http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf>`_ .


+---------------------+---------------------------------------------------------------------------------------------------------------------+
| Record | Description |
+=====================+=====================================================================================================================+
| VariantAnnotationSet| A VariantAnnotationSet record groups VariantAnnotation records. It represents the comparison of a VariantSet to |
| | specified gene annotation data using specified algorithms. It holds information describing the software and |
| | annotation data versions used. |
+---------------------+---------------------------------------------------------------------------------------------------------------------+
| VariantAnnotation | A VariantAnnotation record represents the result of comparing a single variant to the set of annotation data. It |
| | contains structured sub-records and a flexible key-value pair ‘info’ field. |
+---------------------+---------------------------------------------------------------------------------------------------------------------+
| TranscriptEffect | A TranscriptEffect record describes the effect of an allele on a transcript. |
+---------------------+---------------------------------------------------------------------------------------------------------------------+
| AlleleLocation | An AlleleLocation record holds the location of an allele relative to a non-genomic coordinate system such as a CDS |
| | or protein. It holds the reference and alternate sequence where appropriate |
+---------------------+---------------------------------------------------------------------------------------------------------------------+
| HGVSAnnotation | A HGVSAnnotation record holds Human Genome Variation Society ( `HGVS <http://www.hgvs.org/mutnomen/recs.html>`_ ) |
| | descriptions of the sequence change at genomic, transcript and protein level where relevant. |
+---------------------+---------------------------------------------------------------------------------------------------------------------+
| AnalysisResult | An AnalysisResult record holds the output of a prediction package such as SIFT on a specific allele. |
+---------------------+---------------------------------------------------------------------------------------------------------------------+

The schema is shown in the diagram below.

.. image:: /_static/variant_annotation_schema.svg


TranscriptEffect attributes
@@@@@@@@@@@@@@@@@@@@@@@@@@@

A ``VariantAnnotation`` record may have many ``TranscriptEffect`` records as one is
reported for each possible combination of alternate alleles and overlapping
transcripts. The record includes:

* The identifier of the transcript feature the variant was analysed against.
* The alternate allele of the variant analysed. This is necessary as the current variant model supports multiple alternate alleles.
* The predicted effects of the allele on the transcript, which should be described using `Sequence Ontology <http://www.sequenceontology.org>`_ terms.
* A ``HGVSAnnotation`` record containing variant descriptions at all relevant levels.
* ``AlleleLocation`` records describing the changes at cDNA, CDS and protein level.
* A set of results from prediction packages analyzing the allele impact.
* A summary impact classification reflecting the highest impact consequence.

Predicted Molecular Impact Classification
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

The predicted molecular impact is a simple prioritization based on the putative
deleteriousness of the variant allele on the transcript, which is popular with
users of annotation tools. This is usually calculated based on naive algorithms
and may not accurately predict true impact at protein level.

Predicted Molecular Impact classification is summarized using the terms:

+----------+-----------------------------------------------+-------------------------------------------+
| Impact | Meaning | Example SO terms |
+==========+===============================================+===========================================+
| HIGH | Highly likely to disrupt protein function | splice_donor_variant, stop_gained |
+----------+-----------------------------------------------+-------------------------------------------+
| MODERATE | Moderately likely to disrupt protein function | missense_variant, inframe_insertion |
+----------+-----------------------------------------------+-------------------------------------------+
| LOW | Not likely to disrupt protein function | synonymous_variant, stop_retained_variant |
+----------+-----------------------------------------------+-------------------------------------------+
| MODIFIER | No predicted effect | 3_prime_UTR_variant, intron_variant |
+----------+-----------------------------------------------+-------------------------------------------+

Search Options
@@@@@@@@@@@@@@

VariantAnnotationSets can be extracted by Dataset or VariantSet, or retrieved by id.

A VariantAnnotationSet can be searched for VariantAnnotations by region and filters
can be applied.

* A region to search must be specified. This can be done by providing a reference sequence (identified by name or id) with start and end coordinates.
* Results can be filtered by the predicted effect of the variant using a Sequence Ontology OntologyTerm.

2 changes: 1 addition & 1 deletion doc/source/api/apidesign_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
API Design
!!!!!!!!!!


.. _apidesign_object_ids:
Object Ids
@@@@@@@@@@

Expand Down
10 changes: 10 additions & 0 deletions doc/source/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,15 @@ Metadata allows organizing all the primary data types.
metadata


Allele Annotations
@@@@@@@@@@@@@@@@@@

Allele annotations are additional pieces of data often generated by
algorithms which help to describe, classify, and understand variants.

.. toctree::
alleleAnnotations


.. _SAM/BAM: https://samtools.github.io/hts-specs/SAMv1.pdf
.. _VCF: https://samtools.github.io/hts-specs/VCFv4.2.pdf
76 changes: 76 additions & 0 deletions doc/source/api/metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,82 @@ provider, users should not make semantic assumptions about that data.
Subsets of the data in a dataset can be selected for analysis using
other metadata or attributes.

.. _metadata_date_time:

Date and Time Format Specifications
-----------------------------------

Date and time formats are specified as ISO8601 compatible strings, both for
time points as well as for intervals and durations.
An optional required granularity may be specified as part of the respective
attributes' documentations.

Time points
===========

The specification of a time point is given through the concatenation of

* a date in YYYY-MM-DD
* the designator "T" indicating a following time description
* the time of day in HH:MM:SS.SSS form, where "SSS" represents a decimal
fraction of a second
* a time zone offset in relation to UTC

**Examples**

* year (YYYY)
2015

* date (e.g. date of birth) in YYYY-MM-DD
2015-02-10

* time stamp in milliseconds in YYYY-MM-DDTHH:MM:SS.SSS
2015-02-10T00:03:42.123Z

**Implementations**

* created
* updated
* many proposed in metadata branch

Durations
=========

Durations are a specific form of intervals, without reference to time points.
They are indicated with a leading "P", followed by unit delimited
quantifiers. A leading "T" is required before the start of the time components.
Durations do not have to be normalized; "PT50H" is equally valid as "P2T2H".

**Examples**

* age in years in PnY
P44Y

* age in years and months in PnYnM
P44Y08M

* short time interval (e.g. 30min in experimental time series) in PTnM
PT30M

Time intervals
==============

Time intervals consist of a combination of two time designators. These can be
either two time points for start and end, or one time point and a leading
(time point indicates end) or trailing (time point indicates start) duration.
The time elements are separated by a forward slash "/".

**Examples**

* age with date of birth in YYYY-MM-DD/PnYnMnD
1967-11-21/P40Y10M05D

* anchored 3 month interval, e.g. a therapy cycle in YYYY-MM-DD/YYYY-MM-DD
2015-04-18/2015-07-17

* experimental intervention of 30min in YYYY-MM-DDTHH:MM/YYYY-MM-DDTHH:MM
2014-12-31T23H45M/2015-01-01T00H15M


**Use Cases**

Expand Down
2 changes: 1 addition & 1 deletion doc/source/schemas/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ clean:
cleaner: clean
/bin/rm -f *.avpr
cleanest: cleaner
/bin/rm -f ${RST_BASENAMES}
/bin/rm -f ${RST_BASENAMES}
Loading

0 comments on commit 8bb1865

Please sign in to comment.