You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sequences/README.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,22 @@
1
1
---
2
2
layout: default
3
-
title: refget protocol
3
+
title: refget sequences protocol
4
4
suppress_footer: true
5
5
---
6
6
7
7
# Refget Sequences v2.0.0
8
8
9
+
> **Note on naming:** This specification was originally published as "refget", but in 2025 was renamed to "refget sequences", after the *refget sequence collections* specification was approved. The term "refget" is now used as an umbrella term covering both the sequences and sequence collections specification. This document has been updated to use "refget sequences," but "refget" may be used as shorthand to refer to "refget sequences" for historical reasons (e.g., `vnd.ga4gh.refget.v2.0.0+plain`).
10
+
9
11
## Introduction
10
12
11
13
Reference sequences are fundamental to genomic analysis and interpretation however naming is a serious issue. For example the reference genomic sequence GRCh38/1 is also known as hg38/chr1, CM000663.2 and NC_000001.11. In addition there is no standardised way to access reference sequence from providers such as INSDC (ENA, Genbank, DDBJ), Ensembl or UCSC.
12
14
13
-
Refget enables access to reference sequences using an identifier derived from the sequence itself.
15
+
Refget Sequences enables access to reference sequences using an identifier derived from the sequence itself.
14
16
15
-
Refget uses a hash algorithm (by default `MD5`) to generate a checksum identifier, which is a digest of the underlying sequence. This removes the need for a single accessioning authority to identify a reference sequence and improves the provenance of sequence used in analysis. In addition refget defines a simple scheme to retrieve reference sequence via this checksum identifier.
17
+
Refget Sequences uses a hash algorithm (by default `MD5`) to generate a checksum identifier, which is a digest of the underlying sequence. This removes the need for a single accessioning authority to identify a reference sequence and improves the provenance of sequence used in analysis. In addition refget defines a simple scheme to retrieve reference sequence via this checksum identifier.
16
18
17
-
Refget is intended to be used in any scenario where full or partial access to reference sequence is required e.g. the CRAM file format or a genome browser.
19
+
Refget Sequences is intended to be used in any scenario where full or partial access to reference sequence is required e.g. the CRAM file format or a genome browser.
18
20
19
21
## Design principles
20
22
@@ -31,7 +33,7 @@ An OpenAPI description of this specification is available and [describes the 2.0
31
33
32
34
## Compliance
33
35
34
-
Implementers can check if their refget implementations conform to the specification by using our [compliance suite](https://github.com/ga4gh/refget-compliance-suite). A summary of all known public implementations is available from our [compliance report website](https://andrewyatz.github.io/refget-compliance/).
36
+
Implementers can check if their refget sequences implementations conform to the specification by using our [compliance suite](https://github.com/ga4gh/refget-compliance-suite). A summary of all known public implementations is available from our [compliance report website](https://andrewyatz.github.io/refget-compliance/).
The policies and processes used to perform user authentication and authorization, and the means through which access tokens are issued, are beyond the scope of this API specification. GA4GH recommends the use of the OAuth 2.0 framework ([RFC 6749](https://tools.ietf.org/html/rfc6749)) for authentication and authorization.
98
100
99
101
## Checksum calculation
100
-
The recommended checksum algorithms are `MD5` (a 32 character HEX string) and a SHA-512-based system called `ga4gh` (a base64 URL-safe string, see later for details). Servers MUST support sequence retrieval by one or more of these algorithms, and are encouraged to support all to maximize interoperability. An older algorithm called `TRUNC512` existed in version 1.0.0 of refget but is now deprecated in favour of the GA4GH sequence checksum string. It is possible to translate between the `ga4gh` and `TRUNC512` systems however `TRUNC512` usage SHOULD be discouraged.
102
+
The recommended checksum algorithms are `MD5` (a 32 character HEX string) and a SHA-512-based system called `ga4gh` (a base64 URL-safe string, see later for details). Servers MUST support sequence retrieval by one or more of these algorithms, and are encouraged to support all to maximize interoperability. An older algorithm called `TRUNC512` existed in version 1.0.0 of refget sequences but is now deprecated in favour of the GA4GH sequence checksum string. It is possible to translate between the `ga4gh` and `TRUNC512` systems however `TRUNC512` usage SHOULD be discouraged.
101
103
102
104
When calculating the checksum for a sequence, all non-base symbols (\n, spaces, etc) must be removed and then the rest uppercased. The allowed alphabet for checksum calculation is uppercase ASCII letters (`0x41`-`0x5A` or `A-Z`).
103
105
@@ -518,7 +520,7 @@ The algorithm performs a SHA-512 digest of a sequence and creates a `base64url`
518
520
519
521
### Checksum Identifier Identification
520
522
521
-
When a checksum identifier is given to an implementation, it is the server's responsibility to compute what kind of identifier (`MD5`, `ga4gh` or `TRUNC512`) has been given. If provided, the namespace prefix should be used to figure it out. Otherwise `MD5` and `TRUNC512` can be deduced based on length; 32 and 48 characters long respectively. `ga4gh` identifiers can be detected by searching for the string `SQ.`. Should refget officially support alternative checksum based identifiers we will describe the mechanisms to resolve their identification in future versions.
523
+
When a checksum identifier is given to an implementation, it is the server's responsibility to compute what kind of identifier (`MD5`, `ga4gh` or `TRUNC512`) has been given. If provided, the namespace prefix should be used to figure it out. Otherwise `MD5` and `TRUNC512` can be deduced based on length; 32 and 48 characters long respectively. `ga4gh` identifiers can be detected by searching for the string `SQ.`. Should refget sequences officially support alternative checksum based identifiers we will describe the mechanisms to resolve their identification in future versions.
522
524
523
525
## Possible Future API Enhancements
524
526
@@ -560,7 +562,7 @@ The specification makes no attempt to enforce a strict naming authority across i
560
562
|`ensembl`| Ensembl | Used for an identifier assigned by the Ensembl project | Active |
561
563
|`md5`| MD5 | Prefix used to describe digests which have gone through the MD5 algorithm | Active |
562
564
|`refseq`| RefSeq | Used for an identifier assigned by the RefSeq group | Active |
563
-
|`trunc512`| Refget | The old checksum algorithm based on SHA-512 used in v1.0.0 of refget | Deprecated |
565
+
|`trunc512`| Refget | The old checksum algorithm based on SHA-512 used in v1.0.0 of refget sequences | Deprecated |
564
566
|`ga4gh`| Refget | ga4gh identifier, which are prefixed by the term `SQ.`. This is the preferred naming | Active |
565
567
|`md5`| Refget | md5 checksum of the sequence. | Active |
566
568
|`vmc`| VMC | Used for when an identifier is a VMC compatible digest | Deprecated |
@@ -569,7 +571,7 @@ The specification makes no attempt to enforce a strict naming authority across i
569
571
570
572
### v2.0.0
571
573
572
-
- Replace refget's v1 service-info implementation with GA4GH discovery's definition of service-info
574
+
- Replace refget sequences v1 service-info implementation with GA4GH discovery's definition of service-info
573
575
- Move code examples out into a Python notebook and a Perl script
574
576
- Replace TRUNC512 with ga4gh identifier as the default SHA-512-based hash identifier (support still available for TRUNC512)
575
577
- All checksums can be requested namespaced with their algorithm
0 commit comments