diff --git a/docs/compare_collections.md b/docs/compare_collections.md new file mode 100644 index 0000000..838769e --- /dev/null +++ b/docs/compare_collections.md @@ -0,0 +1,45 @@ + +# How to: Compare two collections + +## Use case + +- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system. +- You have two identifiers for collections you know are stored by a server. You want to compare them. + +## How to do it + +You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections. The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples + +### Strict identity + +Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order. + + This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match. + +### Order-relaxed identity + +A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question: + +Two collections meet the criteria for order-relaxed identity if: + +1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements). +2. this value is the same as `elements.a-and-b.` for all attributes (the content is the same) +3. all entries in `elements.a-and-b-same-order.` are false (the order differs for all attributes) + +Then, we know the sequence collection content is identical, but in a different order. + +###### Name-relaxed identity + +Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ." + +How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical + +###### Length-only compatible (shared coordinate system) + +A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses. + +How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system. + +### Others... + +There are also probably other types of compatibility you can assess using the result of the `/comparison` function. Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case. diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md new file mode 100644 index 0000000..86b43ea --- /dev/null +++ b/docs/digest_from_collection.md @@ -0,0 +1,114 @@ + +# How to: Compute a seqcol digest given a sequence collection + +## Use case + + +One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). + + +## 1. Using existing implementations + +### Reference implementation in Python + +If working from within Python, you can use the reference implementation like this: + +1. Install the seqcol package with some variant of `pip install seqcol`. +2. Build up your canonical seqcol object +3. Compute its digest: + +``` +import seqcol +seqcol.digest(seqcol_obj) +``` + +If you have a FASTA file, you could get a canonical seqcol object like this: + +``` +seqcol_obj = seqcol.csc_from_fasta(fa_file) +``` + +## 2. Implement the seqcol digest algorithm yourself + +Follow the procedure under the section for [Encoding](/specification/#1-encoding-computing-sequence-digests-from-sequence-collections). Briefly, the steps are: + +- **Step 1**. Organize the sequence collection data into *canonical seqcol object representation*. +- **Step 2**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) (JCS) to canonicalize the value associated with each attribute individually. +- **Step 3**. Digest each canonicalized attribute value using the GA4GH digest algorithm. +- **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of new seqcol object representation. +- **Step 5**. Digest the final canonical representation again. + +Details on each step can be found in the specification. + + +### Example Python code for computing a seqcol encoding + +```python +# Demo for encoding a sequence collection + +import binascii +import hashlib +import json + +def canonical_str(item: dict) -> str: + """Convert a dict into a canonical string representation""" + return json.dumps( + item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True + ) + +def trunc512_digest(seq, offset=24): + """ GA4GH digest function """ + digest = hashlib.sha512(seq.encode()).digest() + hex_digest = binascii.hexlify(digest[:offset]) + return hex_digest.decode() + +# 1. Get data as canonical seqcol object representation + +seqcol_obj = { + "lengths": [ + 248956422, + 133797422, + 135086622 + ], + "names": [ + "chr1", + "chr2", + "chr3" + ], + "sequences": [ + "2648ae1bacce4ec4b6cf337dcae37816", + "907112d17fcb73bcab1ed1c72b97ce68", + "1511375dc2dd1b633af8cf439ae90cec" + ] +} + +# Step 1a: We would here need to remove any non-inherent attributes, +# so that only the inherent attributes contribute to the digest. +# In this example, all attributes are inherent. + +# Step 2: Apply RFC-8785 to canonicalize the value +# associated with each attribute individually. + +seqcol_obj2 = {} +for attribute in seqcol_obj: + seqcol_obj2[attribute] = canonical_str(seqcol_obj[attribute]) +seqcol_obj2 # visualize the result + +# Step 3: Digest each canonicalized attribute value +# using the GA4GH digest algorithm. + +seqcol_obj3 = {} +for attribute in seqcol_obj2: + seqcol_obj3[attribute] = trunc512_digest(seqcol_obj2[attribute]) +print(json.dumps(seqcol_obj3, indent=2)) # visualize the result + +# Step 4: Apply RFC-8785 again to canonicalize the JSON +# of new seqcol object representation. + +seqcol_obj4 = canonical_str(seqcol_obj3) +seqcol_obj4 # visualize the result + +# Step 5: Digest the final canonical representation again. + +seqcol_digest = trunc512_digest(seqcol_obj4) +``` \ No newline at end of file diff --git a/docs/digest_from_fasta.md b/docs/digest_from_fasta.md deleted file mode 100644 index 0559ba0..0000000 --- a/docs/digest_from_fasta.md +++ /dev/null @@ -1,32 +0,0 @@ - -# Compute a seqcol digest given a sequence collection - -One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). - -## 1. Using existing implementations - -### Reference implementation in Python - -If working from within Python, you can use the reference implementation like this: - -1. Install the seqcol package with some variant of `pip install seqcol`. -2. Build up your canonical seqcol object -3. Compute its digest: - -``` -seqcol.digest(seqcol_obj) -``` - - - -#### From a Canonical Sequence Collection - -If you have a sequence collection in canonical structure, you can get its digest like this: - - - -``` -import seqcol - -seqcol.digest() - diff --git a/docs/fasta_from_digest.md b/docs/fasta_from_digest.md deleted file mode 100644 index 8832e4a..0000000 --- a/docs/fasta_from_digest.md +++ /dev/null @@ -1,4 +0,0 @@ - -# Fasta from digest - -To retrieve a fasta file digest, you need to access the API endpoint of a seqcol server. diff --git a/docs/img/favicon.ico b/docs/img/favicon.ico new file mode 100644 index 0000000..350c120 Binary files /dev/null and b/docs/img/favicon.ico differ diff --git a/docs/img/seqcol_logo.svg b/docs/img/seqcol_logo.svg new file mode 100644 index 0000000..5b3e7fc --- /dev/null +++ b/docs/img/seqcol_logo.svg @@ -0,0 +1,91 @@ + + + + diff --git a/docs/sequences_from_digest.md b/docs/sequences_from_digest.md new file mode 100644 index 0000000..af7297a --- /dev/null +++ b/docs/sequences_from_digest.md @@ -0,0 +1,19 @@ + +# How to: Retrieve a collection given a digest + +## Use case + +You have a seqcol digest, and you'd like to retrieve the underlying sequence identifiers, or sequences themselves. + +## How to do it + +To look up the contents of a digest will require a seqcol service that stores the collection in a database. + +### 1. Retriving the sequence identifiers + +You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence identifiers, then you're done. + + +### 2. Retrieving underlying sequences + +If you need sequences, then you'll also need a refget server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence identifiers, and then use these identifiers to query a refget service. diff --git a/docs/specification.md b/docs/specification.md index d8be6af..5d60ea7 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -20,8 +20,8 @@ This specification is in **DRAFT** form. This is **NOT YET AN APPROVED GA4GH spe Reference sequences are fundamental to genomic analysis. To make their analysis reproducible and efficient, we require tools that can identify, store, retrieve, and compare reference sequences. The primary goal of the *Sequence Collections* (seqcol) project is **to standardize identifiers for collections of sequences**. Seqcol can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. In brief, the project specifies 3 procedures: -1. **An algorithm for encoding sequence identifiers from collections.** The GA4GH standard [refget](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences themselves. Seqcol uses refget identifiers and adds functionality to wrap them into collections. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol identifiers are defined by a hash algorithm, rather than an accession authority, and are thus de-centralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. -2. **A lookup API to retrieve a collection given an identifier.** Seqcol also specifies a RESTful API to retrieve the sequence collections given an identifier, to reproduce the exact reference genome used for analysis. +1. **An algorithm for encoding sequence identifiers.** The GA4GH standard [refget](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol identifiers are defined by a hash algorithm, rather than an accession authority, and are thus de-centralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. +2. **A lookup API to retrieve a collection given an identifier.** Seqcol specifies a RESTful API to retrieve the sequence collections given an identifier, to reproduce the exact reference genome used for analysis, instead of guessing based on a human-readable identifier. 3. **A comparison API to assess compatibility of two collections.** Finally, seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can be used to determine if analysis results based on different references genomes are compatible. @@ -157,7 +157,13 @@ This will turn the values into canonicalized string representations of the list #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. -The GA4GH digest algorithm is `TRUNC-512`. This converts the value of each attribute in the seqcol into a digest string. You will end up with a structure that looks like this: +The GA4GH digest algorithm, `sha512t24u`, was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This procedure is described as ([Hart _et al_. 2020](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883)): + +- performing a SHA-512 digest on a binary blob of data +- truncate the resulting digest to 24 bytes +- encodes the 24 bytes using `base64url` ([RFC 4648](https://datatracker.ietf.org/doc/html/rfc4648#section-5)) resulting in a 32 character string + +This converts the value of each attribute in the seqcol into a digest string. Applying this to each value will produce a structure that looks like this: ```json { @@ -267,32 +273,7 @@ An *unbalanced duplicate* is used in contrast with a *balanced duplicate*. Balan ##### Interpreting the result of the compare function -The output of the comparison function provides rich details about the two collections. These details can be used to make a variety of inferences comparing two collections. For example, here are several practical interpretations: - -###### Strict identity - -Description: Some analyses may require that the collections be *strictly identical*. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order. - -How to assess: This comparison can easily be done by simply comparing the seqcol digest; since two collections that are identical in all aspects will have the same digest, any difference in digest means they are not strictly identical. - -###### Order-relaxed identity - -Description: A downstream process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn't match. Therefore, we’d like to be able to say "these two sequence collections have identical content and sequence names, but differ in order". - -How to assess: If the `elements.total` is the same for `a` and `b`, and this number is also the same for all entries in `a-and-b`, but `a-and-b-same-order` is `false` for one or more attributes, then we know the sequence collection content is identical, but in a different order. - -###### Name-relaxed identity - -Description: Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ." - -How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical - -###### Length-only compatible (shared coordinate system) - -Description: A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses. - -How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system. - +The output of the comparison function provides rich details about the two collections. The comparison function gives information-rich feedback about the two collections. These details can be used to make a variety of inferences comparing two collections, but it can take some thought to interpret. For more details about how to interpret the results of the comparison function to determinine different types of compatibility, please see the [howto guide on comparing sequencing collections](compare_collections.md). ### 3. Ancillary attribute management: recommended non-inherent attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to digest. As long as separate implementations keep such information in non-inherent attributes, the identifiers will remain compatibile. Furthermore, the structure for how such non-inherent metadata is retrieved will be standardized. Here, we specify standardized, useful non-inherent attributes that we recommend. diff --git a/mkdocs.yml b/mkdocs.yml index 9703116..080f2cd 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,13 +1,14 @@ site_name: Seqcol Protocol Specification -site_logo: img/collection.svg +site_logo: img/seqcol_logo.svg site_url: http://seqcol.databio.org/ repo_url: http://github.com/ga4gh/seqcol-spec nav: - Getting Started: - Seqcol specification: specification.md - Detailed how-to guides: - - Compute a digest given a FASTA file: digest_from_fasta.md - - Retrieve a fasta file given a digest: fasta_from_digest.md + - Compute a digest given a SeqCol: digest_from_collection.md + - Retrieve a SeqCol given a digest: sequences_from_digest.md + - Compare two sequence collections: compare_collections.md - Implementations: - Python: implementation_python.md - API: implementation_api.md