Skip to content

Commit a64742a

Browse files
authored
Merge pull request #105 from jmarshall/sequences/core-mechanic
Rewrite Refget Sequences spec's checksum section for clarity
2 parents 5ba657c + 341a1b2 commit a64742a

File tree

2 files changed

+14
-6
lines changed

2 files changed

+14
-6
lines changed

docs/sequences/README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,11 +104,19 @@ When calculating the checksum for a sequence, all non-base symbols (\n, spaces,
104104
Resulting hexadecimal checksum strings shall be considered case insensitive. 0xa is equivalent to 0xA.
105105

106106
## refget Checksum Algorithm
107-
The refget checksum algorithm is called `ga4gh`. It is based on and derived from work carried out by the GA4GH VRS group. It is defined as follows:
107+
The refget checksum algorithm is called `ga4gh`. It is based on and derived from work carried out by the GA4GH VRS group. The checksum of a reference sequence string is computed as follows:
108108

109-
- SHA-512 digest of a sanitised sequence
110-
- A base64 url encoding of the first 24 bytes of that digest
111-
- The addition of `SQ.` to the string
109+
1. Canonicalize the sequence string by removing all non-alphabetic characters, including line terminators and other whitespace, and converting any lowercase letters to uppercase.
110+
111+
(The canonicalised string then contains only uppercase ASCII letters `A-Z`.)
112+
113+
1. Compute the SHA-512 digest of that canonical sequence string.
114+
115+
1. Take the first 24 bytes of that digest and `base64url`-encode them.
116+
117+
(This uses the URL-safe Base 64 variant described in [RFC 4648 §5](https://datatracker.ietf.org/doc/html/rfc4648#section-5), which uses the characters `A-Za-z0-9-_`. Because the length of the digest prefix taken is a multiple of three, the `=` pad character is never necessary.)
118+
119+
1. Prepend `SQ.` to the start of the resulting 32-character text string.
112120

113121
Services may also implement the older `TRUNC512` representation of a truncated SHA-512 digest, which uses similar ideas to the above `ga4gh` string. See later in this specification for implementation details of the TRUNC512 algorithm and conversion between `ga4gh` and `TRUNC512`.
114122

docs/sequences/pub/ga4gh_and_TRUNC512_identifiers.pl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ sub trunc512_digest {
2828
sub _ga4gh_bytes {
2929
my ($bytes, $digest_size) = @_;
3030
my $base64 = encode_base64url($bytes);
31-
my $substr_offset = int($digest_size/3)*4;
32-
my $ga4gh = substr($base64, 0, $substr_offset);
31+
my $base64_size = int($digest_size/3)*4;
32+
my $ga4gh = substr($base64, 0, $base64_size);
3333
return "ga4gh:SQ.${ga4gh}";
3434
}
3535

0 commit comments

Comments
 (0)