Skip to content

Commit 7738f40

Browse files
authored
Merge pull request #81 from ga4gh/dev
Add spec for list, filtered list, and attribute endpoints.
2 parents 353e232 + b994915 commit 7738f40

8 files changed

+592
-222
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Seqcol Docs
1+
# Refget Docs
22

3-
This is the repository for the Seqcol specification. These docs are written using `mkdocs` and hosted on `readthedocs`.
3+
This is the repository for documentation of the GA4GH Refget specifications, which includes both Refget Sequences and Refget Sequence Collections. These docs are written using `mkdocs` using Material for Mkdocs and hosted using GitHub Pages.
44

55
## Building locally
66

_typos.toml

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[default.extend-words]
2+
# Don't correct the "fiw", which shows up in some of our digest examples
3+
fiw = "fiw"
4+
Ot = "Ot"

docs/README.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
1-
# Refget
2-
3-
Unique identifiers and lookup service for reference sequences and sequence collections.
4-
5-
<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
6-
1+
# Refget specifications
72

83
## What is refget?
94

5+
Refget is a protocol for identifying and distributing reference biological sequences.
6+
It currently consists of 2 standards:
107

11-
Refget is a protocol for identifying and distributing biological sequence references. It currently consists of 2 standards:
8+
1. [Refget sequences](sequences.md): a GA4GH-approved standard for individual sequences
9+
2. [Refget sequence collections](seqcol.md): a standard for collections of sequences, under review
10+
11+
<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
1212

13-
1. Refget sequences: a GA4GH-approved standard for individual sequences
14-
2. Refget sequence collections: a standard for collections of sequences, under review
1513

1614
## What is the refget sequences standard?
1715

18-
The original refget handled sequences only. Refget enables access to reference sequences using an identifier derived from the sequence itself.
16+
The original refget standard, now called *Refget sequences*, handles sequences only.
17+
Refget sequences enables access to reference sequences using an identifier derived from the sequence itself.
18+
1919

2020
## What is the refget sequence collections standard?
2121

22-
*Sequence Collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
22+
*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
2323

2424
- implementations of an algorithm for computing sequence identifiers;
2525
- a lookup service to retrieve sequences given a seqcol identifier

docs/contributing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ We welcome more participants! If you are interested in contributing, one of the
44

55
## Maintainers
66

7-
- <a href="http://databio.org">Nathan Sheffield</a>, Center for Public Health Genomics, University of Virginia
7+
- <a href="http://databio.org">Nathan Sheffield</a>, Department of Genome Sciences, University of Virginia
88
- Andy Yates, EMBL-EBI
99
- Timothee Cezard, EMBL-EBI
1010

docs/decision_record.md

+178-35
Large diffs are not rendered by default.

docs/seqcol.md

+358-173
Large diffs are not rendered by default.

docs/seqcol_rationale.md

+37
Original file line numberDiff line numberDiff line change
@@ -82,3 +82,40 @@ One final important point. Sometimes people seeing the standard for the first ti
8282
For reasons outlined in the specification, for the actual computing of the identifier, it's important to use the array-based structure -- this is what enables us to use the "level 1" digests for certain comparison questions, and also provides critical performance benefits for extremely large sequence collections. But don't let this dissuade you! My critical point is this: *the way to compute the interoperable identifier does not force you to structure your data in a certain way in your service* -- it's simply the way you structure the data when you compute its identifier. You are, of course, free to store a collection however you want, in whatever structure makes sense for you. You'd just need to structure it according to the standard if you wanted to implement the algorithm for computing the digest. In fact, my implementation provides a way to retrieve the collection information in either structure.
8383

8484

85+
86+
87+
88+
89+
90+
### Sequence collections without sequences
91+
92+
Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant.
93+
Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this.
94+
First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the *length* of the sequence, and not the actual sequence of characters.
95+
For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence.
96+
On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence.
97+
However, this not *strictly* true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates.
98+
In other words, even if the underlying sequences aren't *exactly* the same, as long as they represent something equivalent, then the coordinates can be compared.
99+
A prerequisite for this is that the *lengths* of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions.
100+
101+
We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system.
102+
Thus, actual sequence content is optional for sequence collections.
103+
We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case.
104+
Thus, we can think of these as a sequence collection without sequence characters.
105+
106+
An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be:
107+
108+
```
109+
{
110+
"lengths": [
111+
"1216",
112+
"970",
113+
"1788"
114+
],
115+
"names": [
116+
"A",
117+
"B",
118+
"C"
119+
]
120+
}
121+
```

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ extra_css:
5454
- stylesheets/extra.css
5555

5656
markdown_extensions:
57+
- admonition
5758
- pymdownx.highlight:
5859
use_pygments: true
5960
- pymdownx.superfences:

0 commit comments

Comments
 (0)