Skip to content

Commit b994915

Browse files
authored
Merge pull request #87 from ga4gh/nov-updates
Nov updates
2 parents 19a0668 + f61021c commit b994915

6 files changed

Lines changed: 467 additions & 221 deletions

File tree

docs/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
1-
# Refget
2-
3-
Unique identifiers and lookup service for reference sequences and sequence collections.
4-
5-
<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
6-
1+
# Refget specifications
72

83
## What is refget?
94

5+
Refget is a protocol for identifying and distributing reference biological sequences.
6+
It currently consists of 2 standards:
107

11-
Refget is a protocol for identifying and distributing biological sequence references. It currently consists of 2 standards:
8+
1. [Refget sequences](sequences.md): a GA4GH-approved standard for individual sequences
9+
2. [Refget sequence collections](seqcol.md): a standard for collections of sequences, under review
10+
11+
<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
1212

13-
1. Refget sequences: a GA4GH-approved standard for individual sequences
14-
2. Refget sequence collections: a standard for collections of sequences, under review
1513

1614
## What is the refget sequences standard?
1715

18-
The original refget handled sequences only. Refget enables access to reference sequences using an identifier derived from the sequence itself.
16+
The original refget standard, now called *Refget sequences*, handles sequences only.
17+
Refget sequences enables access to reference sequences using an identifier derived from the sequence itself.
18+
1919

2020
## What is the refget sequence collections standard?
2121

22-
*Sequence Collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
22+
*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
2323

2424
- implementations of an algorithm for computing sequence identifiers;
2525
- a lookup service to retrieve sequences given a seqcol identifier

docs/contributing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ We welcome more participants! If you are interested in contributing, one of the
44

55
## Maintainers
66

7-
- <a href="http://databio.org">Nathan Sheffield</a>, Center for Public Health Genomics, University of Virginia
7+
- <a href="http://databio.org">Nathan Sheffield</a>, Department of Genome Sciences, University of Virginia
88
- Andy Yates, EMBL-EBI
99
- Timothee Cezard, EMBL-EBI
1010

docs/decision_record.md

Lines changed: 107 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,82 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S
88

99
[TOC]
1010

11+
## 2024-11-20 Level 2 return values should not return transient attributes
12+
13+
### Decision
14+
15+
Level 2 return values should not return transient attributes
16+
17+
### Rationale
18+
19+
We debated whether the `/collection?level=2` endpoint should do with transient attributes, because the level 2 representations are not stored. One train of thought was that it could return the level 1 representation; other is that it just includes nothing. We decided that the more pure approach would be include neither
20+
21+
Another option was something like `?level=highest`, which would return level 2 representations for everything that has one, but level 1 representations for transient attributes.
22+
23+
We decided that even if you don't have that information, you could just get it from the `?level=1` endpoint. Or, implementations could specify their own way
24+
25+
26+
## 2024-11-20 Custom modifiers should live in the schema under the `ga4gh` key
27+
28+
### Decision
29+
30+
Any global custom modifiers should live under a `ga4gh` key in the schemea. Right now, this includes `inherent`, `transient`, and `passthru`.
31+
Local modifiers (currently just `collated`) will continue to live, raw, under the attribute they describe.
32+
33+
34+
### Rationale
35+
36+
We want to follow the standard used in the other specs (VRS), and it also seems fine to have a place to lump together our custom modifiers.
37+
We thought we could also do this for `collated`, as a local modifier, but opt not to right now because: there's only 1, it's a boolean, and it's not actually even used for anything in the spec at the moment, it is only there because it could be nice to use for a visualization of elements in a collection. The additional complexity of another layer just for this seems pointless at this point.
38+
39+
### Linked issues
40+
41+
- <https://github.com/ga4gh/refget/issues/84>
42+
43+
## 2024-11-13 Attributes can be designed as `passthru` or `transient`.
44+
45+
### Decision
46+
47+
We add two new attribute qualifiers: transient and passthru.
48+
49+
- Passthru attributes are not digested in transition from level 2 to level 1. Most attributes of the canonical (level 2) seqcol representation are digested to create the level 1 representation. But sometimes, we have an attribute for which digesting makes little sense. These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation. Thus, we refer to them as passthru attributes.
50+
Transient attributes
51+
52+
- Transient attributes are not retrievable from the attribute endpoint. Most attributes of the sequence collection can be retrieved through the /attribute endpoint. However, some attributes may not be retrievable. For example, this could happen for an attribute that we intend to be used primarily as an identifier. In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant. We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable.
53+
54+
Also, a few other related decisions we finalized:
55+
- `collection` endpoint, level 2 collection representation should exclude transient attributes.
56+
- `attribute` endpoint wouldn't provide anything for either transient or passthru attributes.
57+
- Can passthru or transient attributes be inherent? They could, but it probably doesn't really make sense. Nevertheless, there's no reason to state that they cannot be.
58+
59+
### Rationale
60+
61+
As we worked on more advanced attributes, and with the addition of the `/attribute` endpoint, we realized these changes necessitate a bit more power for the schema to specify behavior of the attributes. For the basic seqcol attributes (names, lengths, sequences) and original endpoint, the general algorithm and basic qualifiers (required, inherent, collated) suffice to describe the representation. But some more nuanced attributes require additional qualifiers to describe their intention and how the server should be behave for the `/attribute` endpoint. For example, sorted_name_length_pairs and sorted_sequences are intended to provide alternative tailored identifiers and comparisons, and not necessarily useful for independent attribute lookup. Similarly, custom extra attributes, like author or alias, may be simple appendages that don't need the complex digesting procedure we use for the basic attributes. In order to flag such attributes in a way that can govern slightly different server expectations, we need a couple of additional advanced attribute qualifiers. For this purpose, we added the passthru and transient qualifiers.
62+
63+
### Linked issues
64+
65+
- <https://github.com/ga4gh/refget/issues/86>
66+
67+
68+
## 2024-10-02 Minimal schema should now require sequences, and lengths should not be inherent.
69+
70+
### Decision
71+
72+
We will update the minimal schema with these changes: 1. Move sequences into 'required', and 2. remove lengths from 'inherent'. So the final qualifiers would be:
73+
- required: names, lengths, and sequences
74+
- inherent: names, sequences
75+
76+
77+
### Rationale
78+
79+
Originally, there was a good rationale for making sequences not required, to allow for coordinate systems to be represented as a seqcol.
80+
But with the new `/attribute` endpoint, there's a better way to handle it, using `name_length_pairs` and `sorted_name_length_pairs` attributes.
81+
Then, with sequences required, it does not make sense for lengths to be inherent because they are computable from sequences.
82+
So essentially, the attribute endpoint allows us to move away from handling coordinate systems as top-level entities, and instead moves toward using the attribute endpoint for coordinate systems.
83+
84+
### Linked issues
85+
86+
- <https://github.com/ga4gh/refget/issues/72>
1187

1288
## 2024-10-02 The `/collection` and `/attribute` endpoints will both be `REQUIRED`
1389

@@ -96,7 +172,7 @@ In the future if the number of proposed ancillary attributes grows, it could mov
96172

97173
### Linked issues
98174

99-
- <https://github.com/ga4gh/seqcol-spec/issues/71>
175+
- <https://github.com/ga4gh/refget/issues/71>
100176

101177

102178
## 2024-02-21 We will specify core sequence collection attributes and a process for adding new ones
@@ -120,9 +196,9 @@ Choosing to host this list as a list of issues allows the list to always be up t
120196

121197
### Linked issues
122198

123-
- <https://github.com/ga4gh/seqcol-spec/issues/50>
124-
- <https://github.com/ga4gh/seqcol-spec/issues/46>
125-
- <https://github.com/ga4gh/seqcol-spec/issues?q=is%3Aissue+is%3Aopen+label%3Aschema-term>
199+
- <https://github.com/ga4gh/refget/issues/50>
200+
- <https://github.com/ga4gh/refget/issues/46>
201+
- <https://github.com/ga4gh/refget/issues?q=is%3Aissue+is%3Aopen+label%3Aschema-term>
126202

127203
## 2024-01-10 Clarifications on the purpose and form of the JSON schema in service-info
128204

@@ -148,8 +224,8 @@ Another issue is that we wanted the schema to be a place where a user could see
148224

149225
### Linked issues
150226

151-
- <https://github.com/ga4gh/seqcol-spec/issues/50>
152-
- <https://github.com/ga4gh/seqcol-spec/issues/39>
227+
- <https://github.com/ga4gh/refget/issues/50>
228+
- <https://github.com/ga4gh/refget/issues/39>
153229

154230
## 2024-01-06 The comparison function use more descriptive attribute names
155231

@@ -171,7 +247,7 @@ The comparison function is designed to compare two sequence collections by inter
171247

172248
### Linked issues
173249

174-
- <https://github.com/ga4gh/seqcol-spec/issues/57>
250+
- <https://github.com/ga4gh/refget/issues/57>
175251

176252

177253
## 2023-08-25 The user-facing API will neither expect nor provide prefixes
@@ -236,7 +312,7 @@ properties:
236312

237313

238314
### Linked issues
239-
- https://github.com/ga4gh/seqcol-spec/issues/40
315+
- https://github.com/ga4gh/refget/issues/40
240316

241317

242318
## 2023-07-26 There will be no metadata endpoint
@@ -256,9 +332,9 @@ We distinguished between two types of metadata:
256332

257333
### Linked issues
258334

259-
- <https://github.com/ga4gh/seqcol-spec/issues/3>
260-
- <https://github.com/ga4gh/seqcol-spec/issues/39>
261-
- <https://github.com/ga4gh/seqcol-spec/issues/40>
335+
- <https://github.com/ga4gh/refget/issues/3>
336+
- <https://github.com/ga4gh/refget/issues/39>
337+
- <https://github.com/ga4gh/refget/issues/40>
262338

263339
## 2023-07-12 - Required attributes are: lengths and names
264340

@@ -302,7 +378,7 @@ This leads us to the conclusion that *sequences* should be optional, and *names*
302378

303379
### Linked issues
304380

305-
- <https://github.com/ga4gh/seqcol-spec/issues/40>
381+
- <https://github.com/ga4gh/refget/issues/40>
306382

307383

308384
## 2023-06-14 - Internal digests SHOULD NOT be prefixed
@@ -335,7 +411,7 @@ Adding prefixes will complicate things and does not add benefits. Prefixes may b
335411

336412
### Linked issues
337413

338-
- <https://github.com/ga4gh/seqcol-spec/issues/37>
414+
- <https://github.com/ga4gh/refget/issues/37>
339415

340416

341417
## 2023-06-28 - SeqCol JSON schema defines reserved attributes without additional namespacing
@@ -400,7 +476,7 @@ Thus, we introduce the idea of *inherent* vs *non-inherent attributes*. Inherent
400476

401477
### Linked issues
402478

403-
- <https://github.com/ga4gh/seqcol-spec/issues/40>
479+
- <https://github.com/ga4gh/refget/issues/40>
404480

405481
### Alternatives considered
406482

@@ -420,7 +496,7 @@ While non-ASCII array names would be compatible with our current specification,
420496

421497
### Linked issues
422498

423-
- <https://github.com/ga4gh/seqcol-spec/issues/33>
499+
- <https://github.com/ga4gh/refget/issues/33>
424500

425501

426502
## 2023-01-25 - The digest algorithm will be the GA4GH digest
@@ -449,7 +525,7 @@ Under this scheme the string `ACGT` will result in the `sha512t24u` digest `aKF4
449525

450526
### Linked issues
451527

452-
- [https://github.com/ga4gh/seqcol-spec/issues/30](https://github.com/ga4gh/seqcol-spec/issues/30)
528+
- [https://github.com/ga4gh/refget/issues/30](https://github.com/ga4gh/refget/issues/30)
453529

454530

455531
## 2023-01-12 - How sequence collection are serialized prior to digestion
@@ -536,9 +612,9 @@ It also future-proofs the serialisation method if we ever allow complex object t
536612

537613
### Linked issues
538614

539-
- [https://github.com/ga4gh/seqcol-spec/issues/1](https://github.com/ga4gh/seqcol-spec/issues/1)
540-
- [https://github.com/ga4gh/seqcol-spec/issues/25](https://github.com/ga4gh/seqcol-spec/issues/25)
541-
- [https://github.com/ga4gh/seqcol-spec/issues/33](https://github.com/ga4gh/seqcol-spec/issues/33)
615+
- [https://github.com/ga4gh/refget/issues/1](https://github.com/ga4gh/refget/issues/1)
616+
- [https://github.com/ga4gh/refget/issues/25](https://github.com/ga4gh/refget/issues/25)
617+
- [https://github.com/ga4gh/refget/issues/33](https://github.com/ga4gh/refget/issues/33)
542618

543619

544620
### Known limitations
@@ -636,7 +712,7 @@ We should be consistent by using these terms to refer to the above representatio
636712

637713

638714
### Linked issues
639-
- <https://github.com/ga4gh/seqcol-spec/issues/25>
715+
- <https://github.com/ga4gh/refget/issues/25>
640716

641717

642718
## 2022-06-15 - Structure for the return value of the comparison API endpoint
@@ -704,8 +780,8 @@ The primary purpose of the compare function is to provide a high-level view of h
704780

705781
### Linked issues
706782

707-
- <https://github.com/ga4gh/seqcol-spec/issues/21>
708-
- <https://github.com/ga4gh/seqcol-spec/issues/7>
783+
- <https://github.com/ga4gh/refget/issues/21>
784+
- <https://github.com/ga4gh/refget/issues/7>
709785

710786
### Alternatives considered
711787

@@ -778,8 +854,8 @@ We need a formal definition of a sequence collection. The schema provides a mach
778854

779855
### Linked issues
780856

781-
- <https://github.com/ga4gh/seqcol-spec/issues/8>
782-
- <https://github.com/ga4gh/seqcol-spec/issues/6>
857+
- <https://github.com/ga4gh/refget/issues/8>
858+
- <https://github.com/ga4gh/refget/issues/6>
783859

784860

785861
## 2021-12-01 - Endpoint names and structure
@@ -825,8 +901,8 @@ For the `POST comparison` endpoint, we made 2 limitations to simplify the implem
825901

826902
### Linked issues
827903

828-
- [https://github.com/ga4gh/seqcol-spec/issues/21](https://github.com/ga4gh/seqcol-spec/issues/21)
829-
- [https://github.com/ga4gh/seqcol-spec/issues/23](https://github.com/ga4gh/seqcol-spec/issues/23)
904+
- [https://github.com/ga4gh/refget/issues/21](https://github.com/ga4gh/refget/issues/21)
905+
- [https://github.com/ga4gh/refget/issues/23](https://github.com/ga4gh/refget/issues/23)
830906

831907
## 2021-09-21 - Order will be recognized by digesting arrays in the given order, and unordered digests will be handled as extensions through additional attributes
832908

@@ -854,7 +930,7 @@ To conclude, option A seems simple and straightforward, satisfies for a basic im
854930

855931
### Linked issues
856932

857-
- https://github.com/ga4gh/seqcol-spec/issues/5
933+
- https://github.com/ga4gh/refget/issues/5
858934

859935
### Known limitations
860936

@@ -877,7 +953,7 @@ However, there are also scenarios for which the order of sequences in a collecti
877953

878954
### Linked issues
879955

880-
- [https://github.com/ga4gh/seqcol-spec/issues/5](https://github.com/ga4gh/seqcol-spec/issues/5)
956+
- [https://github.com/ga4gh/refget/issues/5](https://github.com/ga4gh/refget/issues/5)
881957

882958
### Known limitations
883959

@@ -917,8 +993,8 @@ This will allow retrieving individual attributes, and testing for identity of in
917993

918994
### Linked issues
919995

920-
- [https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-773489450](https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-773489450)
921-
- [https://github.com/ga4gh/seqcol-spec/issues/10](https://github.com/ga4gh/seqcol-spec/issues/10)
996+
- [https://github.com/ga4gh/refget/issues/8#issuecomment-773489450](https://github.com/ga4gh/refget/issues/8#issuecomment-773489450)
997+
- [https://github.com/ga4gh/refget/issues/10](https://github.com/ga4gh/refget/issues/10)
922998

923999
### Known limitations
9241000

@@ -937,7 +1013,7 @@ Should a wider GA4GH standard appear from [TASC issue 5](https://github.com/ga4g
9371013

9381014
### Linked issues
9391015

940-
- [https://github.com/ga4gh/seqcol-spec/issues/2](https://github.com/ga4gh/seqcol-spec/issues/2)
1016+
- [https://github.com/ga4gh/refget/issues/2](https://github.com/ga4gh/refget/issues/2)
9411017

9421018
### Known limitations
9431019

0 commit comments

Comments
 (0)