Skip to content

Commit 2e96056

Browse files
authored
Merge pull request #44 from ga4gh/spec_rewrite
shot at bringing draft spec up to date with adrs
2 parents f04dd1c + 8ac2c4d commit 2e96056

File tree

9 files changed

+283
-68
lines changed

9 files changed

+283
-68
lines changed

docs/compare_collections.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
2+
# How to: Compare two collections
3+
4+
## Use case
5+
6+
- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system.
7+
- You have two identifiers for collections you know are stored by a server. You want to compare them.
8+
9+
## How to do it
10+
11+
You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections. The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples
12+
13+
### Strict identity
14+
15+
Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
16+
17+
This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match.
18+
19+
### Order-relaxed identity
20+
21+
A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question:
22+
23+
Two collections meet the criteria for order-relaxed identity if:
24+
25+
1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements).
26+
2. this value is the same as `elements.a-and-b.<attribute>` for all attributes (the content is the same)
27+
3. all entries in `elements.a-and-b-same-order.<attribute>` are false (the order differs for all attributes)
28+
29+
Then, we know the sequence collection content is identical, but in a different order.
30+
31+
###### Name-relaxed identity
32+
33+
Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ."
34+
35+
How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical
36+
37+
###### Length-only compatible (shared coordinate system)
38+
39+
A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.
40+
41+
How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system.
42+
43+
### Others...
44+
45+
There are also probably other types of compatibility you can assess using the result of the `/comparison` function. Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case.

docs/digest_from_collection.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
2+
# How to: Compute a seqcol digest given a sequence collection
3+
4+
## Use case
5+
6+
7+
One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard).
8+
9+
10+
## 1. Using existing implementations
11+
12+
### Reference implementation in Python
13+
14+
If working from within Python, you can use the reference implementation like this:
15+
16+
1. Install the seqcol package with some variant of `pip install seqcol`.
17+
2. Build up your canonical seqcol object
18+
3. Compute its digest:
19+
20+
```
21+
import seqcol
22+
seqcol.digest(seqcol_obj)
23+
```
24+
25+
If you have a FASTA file, you could get a canonical seqcol object like this:
26+
27+
```
28+
seqcol_obj = seqcol.csc_from_fasta(fa_file)
29+
```
30+
31+
## 2. Implement the seqcol digest algorithm yourself
32+
33+
Follow the procedure under the section for [Encoding](/specification/#1-encoding-computing-sequence-digests-from-sequence-collections). Briefly, the steps are:
34+
35+
- **Step 1**. Organize the sequence collection data into *canonical seqcol object representation*.
36+
- **Step 2**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) (JCS) to canonicalize the value associated with each attribute individually.
37+
- **Step 3**. Digest each canonicalized attribute value using the GA4GH digest algorithm.
38+
- **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of new seqcol object representation.
39+
- **Step 5**. Digest the final canonical representation again.
40+
41+
Details on each step can be found in the specification.
42+
43+
44+
### Example Python code for computing a seqcol encoding
45+
46+
```python
47+
# Demo for encoding a sequence collection
48+
49+
import binascii
50+
import hashlib
51+
import json
52+
53+
def canonical_str(item: dict) -> str:
54+
"""Convert a dict into a canonical string representation"""
55+
return json.dumps(
56+
item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True
57+
)
58+
59+
def trunc512_digest(seq, offset=24):
60+
""" GA4GH digest function """
61+
digest = hashlib.sha512(seq.encode()).digest()
62+
hex_digest = binascii.hexlify(digest[:offset])
63+
return hex_digest.decode()
64+
65+
# 1. Get data as canonical seqcol object representation
66+
67+
seqcol_obj = {
68+
"lengths": [
69+
248956422,
70+
133797422,
71+
135086622
72+
],
73+
"names": [
74+
"chr1",
75+
"chr2",
76+
"chr3"
77+
],
78+
"sequences": [
79+
"2648ae1bacce4ec4b6cf337dcae37816",
80+
"907112d17fcb73bcab1ed1c72b97ce68",
81+
"1511375dc2dd1b633af8cf439ae90cec"
82+
]
83+
}
84+
85+
# Step 1a: We would here need to remove any non-inherent attributes,
86+
# so that only the inherent attributes contribute to the digest.
87+
# In this example, all attributes are inherent.
88+
89+
# Step 2: Apply RFC-8785 to canonicalize the value
90+
# associated with each attribute individually.
91+
92+
seqcol_obj2 = {}
93+
for attribute in seqcol_obj:
94+
seqcol_obj2[attribute] = canonical_str(seqcol_obj[attribute])
95+
seqcol_obj2 # visualize the result
96+
97+
# Step 3: Digest each canonicalized attribute value
98+
# using the GA4GH digest algorithm.
99+
100+
seqcol_obj3 = {}
101+
for attribute in seqcol_obj2:
102+
seqcol_obj3[attribute] = trunc512_digest(seqcol_obj2[attribute])
103+
print(json.dumps(seqcol_obj3, indent=2)) # visualize the result
104+
105+
# Step 4: Apply RFC-8785 again to canonicalize the JSON
106+
# of new seqcol object representation.
107+
108+
seqcol_obj4 = canonical_str(seqcol_obj3)
109+
seqcol_obj4 # visualize the result
110+
111+
# Step 5: Digest the final canonical representation again.
112+
113+
seqcol_digest = trunc512_digest(seqcol_obj4)
114+
```

docs/digest_from_fasta.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

docs/fasta_from_digest.md

Lines changed: 0 additions & 4 deletions
This file was deleted.

docs/img/favicon.ico

6.14 KB
Binary file not shown.

0 commit comments

Comments
 (0)