|
1 |
| -# Refget specifications |
2 | 1 |
|
3 |
| -## What is refget? |
| 2 | +{ width="300" align=right } |
4 | 3 |
|
5 |
| -Refget is a protocol for identifying and distributing reference biological sequences. |
6 |
| -It currently consists of 2 standards: |
7 | 4 |
|
8 |
| -1. [Refget sequences](sequences.md): a GA4GH-approved standard for individual sequences |
9 |
| -2. [Refget sequence collections](seqcol.md): a standard for collections of sequences, under review |
| 5 | +# Refget specifications |
10 | 6 |
|
11 |
| -<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive"> |
| 7 | +## What is refget? |
12 | 8 |
|
| 9 | +Refget is a set of GA4GH standards for identifying and distributing reference biological sequences. |
| 10 | +It consists of these standards: |
13 | 11 |
|
14 |
| -## What is the refget sequences standard? |
15 | 12 |
|
16 |
| -The original refget standard, now called *Refget sequences*, handles sequences only. |
17 |
| -Refget sequences enables access to reference sequences using an identifier derived from the sequence itself. |
| 13 | +| Standard | Description | Status | |
| 14 | +| ----------- | ------------------------------------ | | |
| 15 | +| [Refget sequences](sequences.md) | For individual sequences | :white_check_mark: v1.0 Approved in 2021 <br>:white_check_mark: v2.0 Approved in 2024 | |
| 16 | +| [Refget sequence collections](seqcol.md) | For collections of sequences | :white_check_mark: v1.0 Approved in 2025 | |
| 17 | +| Refget pangenomes | For collections of sequence collections | :fontawesome-solid-gears: Currently in process | |
18 | 18 |
|
| 19 | +## What is the main purpose of the refget project? |
19 | 20 |
|
20 |
| -## What is the refget sequence collections standard? |
| 21 | +Refget standards help to **identify**, **retrieve**, and **compare** reference sequences, like a reference genome. Key principles include: |
21 | 22 |
|
22 |
| -*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides: |
| 23 | +- Reference data, including sequences and collections of sequences, are identified using cryptographic digest-based identifiers that are **derived from the data itself**. This allows reference data to be identified without requiring a centralized accessioning authority. |
| 24 | +- Refget standards can be used for any type of sequences: DNA, RNA, protein, etc -- anything that can be represented as a string of characters. |
| 25 | +- Refget standards also specify **retrieval APIs**, providing a mechanism for retrieving a sequence or collection if you have its identifier. |
| 26 | +- Refget sequence collections also provides a programmatic approach to assessing compatibility among sequence collections. |
23 | 27 |
|
24 |
| -- implementations of an algorithm for computing sequence identifiers; |
25 |
| -- a lookup service to retrieve sequences given a seqcol identifier |
26 |
| -- programmatic approach to assessing compatibility among sequence collections. |
| 28 | +This image shows how the Refget Sequences standard is used by the Sequence Collections standard. First, sequences are digested to yield a deterministic identifier. These sequence identifiers are then used, together with their names, to create an identifier for a collection. |
27 | 29 |
|
| 30 | +<figure> |
| 31 | +<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive"> |
| 32 | +</figure> |
0 commit comments