You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Seqcol Docs
1
+
# Refget Docs
2
2
3
-
This is the repository for the Seqcol specification. These docs are written using `mkdocs` and hosted on `readthedocs`.
3
+
This is the repository for documentation of the GA4GH Refget specifications, which includes both Refget Sequences and Refget Sequence Collections. These docs are written using `mkdocs`using Material for Mkdocs and hosted using GitHub Pages.
1. Refget sequences: a GA4GH-approved standard for individual sequences
14
-
2. Refget sequence collections: a standard for collections of sequences, under review
15
13
16
14
## What is the refget sequences standard?
17
15
18
-
The original refget handled sequences only. Refget enables access to reference sequences using an identifier derived from the sequence itself.
16
+
The original refget standard, now called *Refget sequences*, handles sequences only.
17
+
Refget sequences enables access to reference sequences using an identifier derived from the sequence itself.
18
+
19
19
20
20
## What is the refget sequence collections standard?
21
21
22
-
*Sequence Collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
22
+
*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
23
23
24
24
- implementations of an algorithm for computing sequence identifiers;
25
25
- a lookup service to retrieve sequences given a seqcol identifier
Copy file name to clipboardExpand all lines: docs/seqcol_rationale.md
+37
Original file line number
Diff line number
Diff line change
@@ -82,3 +82,40 @@ One final important point. Sometimes people seeing the standard for the first ti
82
82
For reasons outlined in the specification, for the actual computing of the identifier, it's important to use the array-based structure -- this is what enables us to use the "level 1" digests for certain comparison questions, and also provides critical performance benefits for extremely large sequence collections. But don't let this dissuade you! My critical point is this: *the way to compute the interoperable identifier does not force you to structure your data in a certain way in your service* -- it's simply the way you structure the data when you compute its identifier. You are, of course, free to store a collection however you want, in whatever structure makes sense for you. You'd just need to structure it according to the standard if you wanted to implement the algorithm for computing the digest. In fact, my implementation provides a way to retrieve the collection information in either structure.
83
83
84
84
85
+
86
+
87
+
88
+
89
+
90
+
### Sequence collections without sequences
91
+
92
+
Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant.
93
+
Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this.
94
+
First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the *length* of the sequence, and not the actual sequence of characters.
95
+
For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence.
96
+
On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence.
97
+
However, this not *strictly* true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates.
98
+
In other words, even if the underlying sequences aren't *exactly* the same, as long as they represent something equivalent, then the coordinates can be compared.
99
+
A prerequisite for this is that the *lengths* of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions.
100
+
101
+
We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system.
102
+
Thus, actual sequence content is optional for sequence collections.
103
+
We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case.
104
+
Thus, we can think of these as a sequence collection without sequence characters.
105
+
106
+
An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be:
0 commit comments