From f975c55ca68c4624eb9e39254cca253fde8240c0 Mon Sep 17 00:00:00 2001 From: nsheff Date: Fri, 28 Jul 2023 16:34:58 -0400 Subject: [PATCH 1/6] update logo --- docs/img/seqcol_logo.svg | 91 ++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 2 +- 2 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 docs/img/seqcol_logo.svg diff --git a/docs/img/seqcol_logo.svg b/docs/img/seqcol_logo.svg new file mode 100644 index 0000000..5b3e7fc --- /dev/null +++ b/docs/img/seqcol_logo.svg @@ -0,0 +1,91 @@ + + + + diff --git a/mkdocs.yml b/mkdocs.yml index 9703116..a332a22 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,5 +1,5 @@ site_name: Seqcol Protocol Specification -site_logo: img/collection.svg +site_logo: img/seqcol_logo.svg site_url: http://seqcol.databio.org/ repo_url: http://github.com/ga4gh/seqcol-spec nav: From dd4d4aa46356626355272ac9f0873a4b28d36104 Mon Sep 17 00:00:00 2001 From: nsheff Date: Fri, 28 Jul 2023 16:50:42 -0400 Subject: [PATCH 2/6] add sha512t24u description --- docs/specification.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/specification.md b/docs/specification.md index 8376045..13087ce 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -20,8 +20,8 @@ This specification is in **DRAFT** form. This is **NOT YET AN APPROVED GA4GH spe Reference sequences are fundamental to genomic analysis. To make their analysis reproducible and efficient, we require tools that can identify, store, retrieve, and compare reference sequences. The primary goal of the *Sequence Collections* (seqcol) project is **to standardize identifiers for collections of sequences**. Seqcol can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. In brief, the project specifies 3 procedures: -1. **An algorithm for encoding sequence identifiers from collections.** The GA4GH standard [refget](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences themselves. Seqcol uses refget identifiers and adds functionality to wrap them into collections. Secol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol identifiers are defined by a hash algorithm, rather than an accession authority, and are thus de-centralized and usable for many purposes, including private or new sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. -2. **A lookup API to retrieve a collection given an identifier.** Seqcol also specifies a RESTful API to enable retrieving the sequence collections given an identifier. This allows one to retrieve the exact reference genome used for an analysis. +1. **An algorithm for encoding sequence identifiers.** The GA4GH standard [refget](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget identifiers and adds functionality to wrap them into collections of sequences. Secol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol identifiers are defined by a hash algorithm, rather than an accession authority, and are thus de-centralized and usable for many purposes, including private or new sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. +2. **A lookup API to retrieve a collection given an identifier.** Seqcol specifies a RESTful API to enable retrieving the sequence collections given an identifier. This allows one to retrieve the exact reference genome used for an analysis, instead of guessing based on a human-readable identifier. 3. **A comparison API to assess compatibility of two collections.** Finally, seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can be used to determine if analysis results that used different references genomes may still be compatible. @@ -154,7 +154,13 @@ This will turn the values into canonicalized string representations of the list #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. -The GA4GH digest algorithm is `TRUNC-512`. This converts the value of each attribute in the seqcol into a digest string. You will end up with a structure that looks like this: +The GA4GH digest algorithm, `sha512t24u`, was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This procedure is described as ([Hart _et al_. 2020](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883)): + +- performing a SHA-512 digest on a binary blob of data +- truncate the resulting digest to 24 bytes +- encodes the 24 bytes using `base64url` ([RFC 4648](https://datatracker.ietf.org/doc/html/rfc4648#section-5)) resulting in a 32 character string + +This converts the value of each attribute in the seqcol into a digest string. Applying this to each value will produce a structure that looks like this: ```json { From 1d661d372704d628298374743d0f555a7b1db865 Mon Sep 17 00:00:00 2001 From: nsheff Date: Fri, 28 Jul 2023 17:12:55 -0400 Subject: [PATCH 3/6] add use cases --- docs/compare_collections.md | 32 ++++++++++++ docs/digest_from_collection.md | 93 ++++++++++++++++++++++++++++++++++ docs/digest_from_fasta.md | 7 --- docs/fasta_from_digest.md | 4 -- docs/sequences_from_digest.md | 19 +++++++ docs/specification.md | 74 --------------------------- mkdocs.yml | 5 +- 7 files changed, 147 insertions(+), 87 deletions(-) create mode 100644 docs/compare_collections.md create mode 100644 docs/digest_from_collection.md delete mode 100644 docs/digest_from_fasta.md delete mode 100644 docs/fasta_from_digest.md create mode 100644 docs/sequences_from_digest.md diff --git a/docs/compare_collections.md b/docs/compare_collections.md new file mode 100644 index 0000000..3b7f754 --- /dev/null +++ b/docs/compare_collections.md @@ -0,0 +1,32 @@ + +# How to: Compare two collections + +## Use case + +- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system. +- You have two identifiers for collections you know are stored by a server. You want to compare them. + + +## How to do it + +You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections. The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples + +### Strict identity + +If you're looking to ensure that the two sequence collections are *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order... then you actually don't need the `/comparison` endpoint; **two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. + +If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match. + +### Order-relaxed identity + +A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question: + +Two collections meet the criteria for order-relaxed identity if: + +1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements). +2. this value is the same as `elements.a-and-b.` for all attributes (the content is the same) +3. all entries in `elements.a-and-b-same-order.` are false (the order differs for all attributes) + +### Others... + +There are many other types of compatibilty you can assess using the result of the `/comparison` function, which will be documented later. diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md new file mode 100644 index 0000000..1427d93 --- /dev/null +++ b/docs/digest_from_collection.md @@ -0,0 +1,93 @@ + +# How to: Digest from collection + +## Use case + +You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. + +## How to do it + +Follow the procedure under the section for [Encoding](/specification/#1-encoding-computing-sequence-digests-from-sequence-collections). Briefly, the steps are: + +- **Step 1**. Organize the sequence collection data into *canonical seqcol object representation*. +- **Step 2**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) (JCS) to canonicalize the value associated with each attribute individually. +- **Step 3**. Digest each canonicalized attribute value using the GA4GH digest algorithm. +- **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of new seqcol object representation. +- **Step 5**. Digest the final canonical representation again. + +Details on each step can be found in the specification. + + +## Example Python code for computing a seqcol encoding + +```python +# Demo for encoding a sequence collection + +import binascii +import hashlib +import json + +def canonical_str(item: dict) -> str: + """Convert a dict into a canonical string representation""" + return json.dumps( + item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True + ) + +def trunc512_digest(seq, offset=24): + """ GA4GH digest function """ + digest = hashlib.sha512(seq.encode()).digest() + hex_digest = binascii.hexlify(digest[:offset]) + return hex_digest.decode() + +# 1. Get data as canonical seqcol object representation + +seqcol_obj = { + "lengths": [ + 248956422, + 133797422, + 135086622 + ], + "names": [ + "chr1", + "chr2", + "chr3" + ], + "sequences": [ + "2648ae1bacce4ec4b6cf337dcae37816", + "907112d17fcb73bcab1ed1c72b97ce68", + "1511375dc2dd1b633af8cf439ae90cec" + ] +} + +# Step 1a: We would here need to remove any non-inherent attributes, +# so that only the inherent attributes contribute to the digest. +# In this example, all attributes are inherent. + +# Step 2: Apply RFC-8785 to canonicalize the value +# associated with each attribute individually. + +seqcol_obj2 = {} +for attribute in seqcol_obj: + seqcol_obj2[attribute] = canonical_str(seqcol_obj[attribute]) +seqcol_obj2 # visualize the result + +# Step 3: Digest each canonicalized attribute value +# using the GA4GH digest algorithm. + +seqcol_obj3 = {} +for attribute in seqcol_obj2: + seqcol_obj3[attribute] = trunc512_digest(seqcol_obj2[attribute]) +print(json.dumps(seqcol_obj3, indent=2)) # visualize the result + +# Step 4: Apply RFC-8785 again to canonicalize the JSON +# of new seqcol object representation. + +seqcol_obj4 = canonical_str(seqcol_obj3) +seqcol_obj4 # visualize the result + +# Step 5: Digest the final canonical representation again. + +seqcol_digest = trunc512_digest(seqcol_obj4) + + +``` \ No newline at end of file diff --git a/docs/digest_from_fasta.md b/docs/digest_from_fasta.md deleted file mode 100644 index 420b3e2..0000000 --- a/docs/digest_from_fasta.md +++ /dev/null @@ -1,7 +0,0 @@ - -# Digest from fasta - -One of the most common uses of the seqcol specification is to compute a standard, universal identifier from a FASTA file. - -We are working on defining the final algorithm. This page is a placeholder for once the algorithm is defined. - diff --git a/docs/fasta_from_digest.md b/docs/fasta_from_digest.md deleted file mode 100644 index 8832e4a..0000000 --- a/docs/fasta_from_digest.md +++ /dev/null @@ -1,4 +0,0 @@ - -# Fasta from digest - -To retrieve a fasta file digest, you need to access the API endpoint of a seqcol server. diff --git a/docs/sequences_from_digest.md b/docs/sequences_from_digest.md new file mode 100644 index 0000000..8500b6c --- /dev/null +++ b/docs/sequences_from_digest.md @@ -0,0 +1,19 @@ + +# How to: Collection from digest + +## Use case + +You have a seqcol digest, and you'd like to retrieve the underlying sequence identifiers, or sequences themselves. + +## How to do it + +To look up the contents of a digest will require a seqcol service that stores the collection in a database. + +### 1. Retriving the sequence identifiers + +You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence identifiers, then you're done. + + +### 2. Retrieving underlying sequences + +If you need sequences, then you'll also need a refget server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence identifiers, and then use these identifiers to query a refget service. diff --git a/docs/specification.md b/docs/specification.md index 13087ce..1c3191b 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -308,77 +308,3 @@ In the canonical seqcol object structure, we first organize the sequence collect ### F2. Details of inherent and non-inherent attributes The specification in section 1, *Encoding*, described how to structure a sequence collection and then apply an algorithm to compute a digest for it. What if you have ancillary information that goes with a collection, but shouldn't contribute to the digest? We have found a lot of useful use cases for information that should go along with a seqcol, but should not contribute to the *identity* of that seqcol. This is a useful construct as it allows us to include information in a collection that does not affect the identifier that is computed for that collection. One simple example is the "author" or "uploader" of a reference sequence; this is useful information to store alongside this collection, but we wouldn't want the same collection with two different authors to have a different identifier! Seqcol refers to these as *non-inherent attributes*, meaning they are not part of the core identity of the sequence collection. Non-inherent attributes are defined in the seqcol schema, but excluded from the `inherent` list. - -### F3. Example Python code for computing a seqcol encoding - -```python -# Demo for encoding a sequence collection - -import binascii -import hashlib -import json - -def canonical_str(item: dict) -> str: - """Convert a dict into a canonical string representation""" - return json.dumps( - item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True - ) - -def trunc512_digest(seq, offset=24): - """ GA4GH digest function """ - digest = hashlib.sha512(seq.encode()).digest() - hex_digest = binascii.hexlify(digest[:offset]) - return hex_digest.decode() - -# 1. Get data as canonical seqcol object representation - -seqcol_obj = { - "lengths": [ - 248956422, - 133797422, - 135086622 - ], - "names": [ - "chr1", - "chr2", - "chr3" - ], - "sequences": [ - "2648ae1bacce4ec4b6cf337dcae37816", - "907112d17fcb73bcab1ed1c72b97ce68", - "1511375dc2dd1b633af8cf439ae90cec" - ] -} - -# Step 1a: We would here need to remove any non-inherent attributes, -# so that only the inherent attributes contribute to the digest. -# In this example, all attributes are inherent. - -# Step 2: Apply RFC-8785 to canonicalize the value -# associated with each attribute individually. - -seqcol_obj2 = {} -for attribute in seqcol_obj: - seqcol_obj2[attribute] = canonical_str(seqcol_obj[attribute]) -seqcol_obj2 # visualize the result - -# Step 3: Digest each canonicalized attribute value -# using the GA4GH digest algorithm. - -seqcol_obj3 = {} -for attribute in seqcol_obj2: - seqcol_obj3[attribute] = trunc512_digest(seqcol_obj2[attribute]) -print(json.dumps(seqcol_obj3, indent=2)) # visualize the result - -# Step 4: Apply RFC-8785 again to canonicalize the JSON -# of new seqcol object representation. - -seqcol_obj4 = canonical_str(seqcol_obj3) -seqcol_obj4 # visualize the result - -# Step 5: Digest the final canonical representation again. - -seqcol_digest = trunc512_digest(seqcol_obj4) - - -``` \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a332a22..080f2cd 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -6,8 +6,9 @@ nav: - Getting Started: - Seqcol specification: specification.md - Detailed how-to guides: - - Compute a digest given a FASTA file: digest_from_fasta.md - - Retrieve a fasta file given a digest: fasta_from_digest.md + - Compute a digest given a SeqCol: digest_from_collection.md + - Retrieve a SeqCol given a digest: sequences_from_digest.md + - Compare two sequence collections: compare_collections.md - Implementations: - Python: implementation_python.md - API: implementation_api.md From 02ee0b6395175618d3ee6e870bede3567e04bf5c Mon Sep 17 00:00:00 2001 From: nsheff Date: Tue, 22 Aug 2023 10:15:22 -0400 Subject: [PATCH 4/6] cleanup --- docs/digest_from_collection.md | 33 +++++++++++++++++++++++++++------ docs/digest_from_fasta.md | 32 -------------------------------- 2 files changed, 27 insertions(+), 38 deletions(-) delete mode 100644 docs/digest_from_fasta.md diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md index 1427d93..86b43ea 100644 --- a/docs/digest_from_collection.md +++ b/docs/digest_from_collection.md @@ -1,11 +1,34 @@ -# How to: Digest from collection +# How to: Compute a seqcol digest given a sequence collection ## Use case -You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. -## How to do it +One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). + + +## 1. Using existing implementations + +### Reference implementation in Python + +If working from within Python, you can use the reference implementation like this: + +1. Install the seqcol package with some variant of `pip install seqcol`. +2. Build up your canonical seqcol object +3. Compute its digest: + +``` +import seqcol +seqcol.digest(seqcol_obj) +``` + +If you have a FASTA file, you could get a canonical seqcol object like this: + +``` +seqcol_obj = seqcol.csc_from_fasta(fa_file) +``` + +## 2. Implement the seqcol digest algorithm yourself Follow the procedure under the section for [Encoding](/specification/#1-encoding-computing-sequence-digests-from-sequence-collections). Briefly, the steps are: @@ -18,7 +41,7 @@ Follow the procedure under the section for [Encoding](/specification/#1-encoding Details on each step can be found in the specification. -## Example Python code for computing a seqcol encoding +### Example Python code for computing a seqcol encoding ```python # Demo for encoding a sequence collection @@ -88,6 +111,4 @@ seqcol_obj4 # visualize the result # Step 5: Digest the final canonical representation again. seqcol_digest = trunc512_digest(seqcol_obj4) - - ``` \ No newline at end of file diff --git a/docs/digest_from_fasta.md b/docs/digest_from_fasta.md deleted file mode 100644 index 0559ba0..0000000 --- a/docs/digest_from_fasta.md +++ /dev/null @@ -1,32 +0,0 @@ - -# Compute a seqcol digest given a sequence collection - -One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). - -## 1. Using existing implementations - -### Reference implementation in Python - -If working from within Python, you can use the reference implementation like this: - -1. Install the seqcol package with some variant of `pip install seqcol`. -2. Build up your canonical seqcol object -3. Compute its digest: - -``` -seqcol.digest(seqcol_obj) -``` - - - -#### From a Canonical Sequence Collection - -If you have a sequence collection in canonical structure, you can get its digest like this: - - - -``` -import seqcol - -seqcol.digest() - From 793773498a9c3d0e6e19ebf27d70746ffb51277b Mon Sep 17 00:00:00 2001 From: nsheff Date: Tue, 22 Aug 2023 10:36:06 -0400 Subject: [PATCH 5/6] howto updates --- docs/compare_collections.md | 21 +++++++++++++++++---- docs/sequences_from_digest.md | 2 +- docs/specification.md | 27 +-------------------------- 3 files changed, 19 insertions(+), 31 deletions(-) diff --git a/docs/compare_collections.md b/docs/compare_collections.md index 3b7f754..838769e 100644 --- a/docs/compare_collections.md +++ b/docs/compare_collections.md @@ -6,16 +6,15 @@ - You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system. - You have two identifiers for collections you know are stored by a server. You want to compare them. - ## How to do it You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections. The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples ### Strict identity -If you're looking to ensure that the two sequence collections are *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order... then you actually don't need the `/comparison` endpoint; **two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. +Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order. -If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match. + This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match. ### Order-relaxed identity @@ -27,6 +26,20 @@ Two collections meet the criteria for order-relaxed identity if: 2. this value is the same as `elements.a-and-b.` for all attributes (the content is the same) 3. all entries in `elements.a-and-b-same-order.` are false (the order differs for all attributes) +Then, we know the sequence collection content is identical, but in a different order. + +###### Name-relaxed identity + +Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ." + +How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical + +###### Length-only compatible (shared coordinate system) + +A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses. + +How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system. + ### Others... -There are many other types of compatibilty you can assess using the result of the `/comparison` function, which will be documented later. +There are also probably other types of compatibility you can assess using the result of the `/comparison` function. Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case. diff --git a/docs/sequences_from_digest.md b/docs/sequences_from_digest.md index 8500b6c..af7297a 100644 --- a/docs/sequences_from_digest.md +++ b/docs/sequences_from_digest.md @@ -1,5 +1,5 @@ -# How to: Collection from digest +# How to: Retrieve a collection given a digest ## Use case diff --git a/docs/specification.md b/docs/specification.md index 804b61d..5d60ea7 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -273,32 +273,7 @@ An *unbalanced duplicate* is used in contrast with a *balanced duplicate*. Balan ##### Interpreting the result of the compare function -The output of the comparison function provides rich details about the two collections. These details can be used to make a variety of inferences comparing two collections. For example, here are several practical interpretations: - -###### Strict identity - -Description: Some analyses may require that the collections be *strictly identical*. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order. - -How to assess: This comparison can easily be done by simply comparing the seqcol digest; since two collections that are identical in all aspects will have the same digest, any difference in digest means they are not strictly identical. - -###### Order-relaxed identity - -Description: A downstream process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn't match. Therefore, we’d like to be able to say "these two sequence collections have identical content and sequence names, but differ in order". - -How to assess: If the `elements.total` is the same for `a` and `b`, and this number is also the same for all entries in `a-and-b`, but `a-and-b-same-order` is `false` for one or more attributes, then we know the sequence collection content is identical, but in a different order. - -###### Name-relaxed identity - -Description: Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ." - -How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical - -###### Length-only compatible (shared coordinate system) - -Description: A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses. - -How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system. - +The output of the comparison function provides rich details about the two collections. The comparison function gives information-rich feedback about the two collections. These details can be used to make a variety of inferences comparing two collections, but it can take some thought to interpret. For more details about how to interpret the results of the comparison function to determinine different types of compatibility, please see the [howto guide on comparing sequencing collections](compare_collections.md). ### 3. Ancillary attribute management: recommended non-inherent attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to digest. As long as separate implementations keep such information in non-inherent attributes, the identifiers will remain compatibile. Furthermore, the structure for how such non-inherent metadata is retrieved will be standardized. Here, we specify standardized, useful non-inherent attributes that we recommend. From 8ac2c4d0378802667b7af648e9f3ef4e2345f4d8 Mon Sep 17 00:00:00 2001 From: nsheff Date: Tue, 22 Aug 2023 10:47:47 -0400 Subject: [PATCH 6/6] favicon --- docs/img/favicon.ico | Bin 0 -> 6289 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/img/favicon.ico diff --git a/docs/img/favicon.ico b/docs/img/favicon.ico new file mode 100644 index 0000000000000000000000000000000000000000..350c120fde6e15c98f1a57a5a7c6166af9e70dd5 GIT binary patch literal 6289 zcmV;C7;fi@P)pF8FWQhbW?9;ba!ELWdL_~cP?peYja~^aAhuUa%Y?FJQ@H17#>MP zK~#90?Ol18Q`Oo3z2|0zMF(VY!KEs=6+tA)03vNct+lpV1s$ymA~MMcqP3N;F0DAy zYPD7^3NuNjer~lYE)}g?t5O$)ncT3bXqBRERj@82Vqs>I+~4=dy*D`t$xLQ$^7Q%o z^!GgT%vtU|@6DTY&+@*%^F~Obub4QobyzxWtORf{fX@MZ24EwASCq0IZ*8qz+vmms zOq#TMP%>FHT`6liDMtX<1AyVkdRK(6T9$giH2teQ%GooO@|{V_w?udZfZ@*PI658= zZ0&2i;zhZEK;l~>e7b1k!sh_|psg+VdT-jDGUb(;9Xkdzim(B|fD&$bnUs)e`rqu0 zoN?pU@8a{N7Kv~!fPVS)N?B7|TWkN^n>KUq@9T|@aQ*%lMED|rPuv1hjwNMvZEgIt zQrip$&Eq>d`oAf{OOaJbDTw2QKEd^y$t`1UmILR*OuUdq6O_}maO|g3dfy69Q?gUWf6fKU&gGZaD|Cx9^ zkSui1>eUn81@N*~RIQYC|HO%{!-~nNt&PtB@E~%BfoIy0tMZussMTE@&^w)3`iM~t-BP>f@SE$18Uo(!ByJ3HU+X0kSnfa)+ zauJUQZw0Up07Q6UFlY|c>nBfsbtr&kuzQlVa|^w~IshPLP9Tu@s$MaE{My}=vK|32 z5P(I>82~yw%Ig*tSFY3hZgjFkNy!;|lgX-^0PJO7E7yCICGK0Iv^NF-;Pa(!Kz596 zG)?~#USySmg31*UB_d~u$g?7{(yNe>BR4%j%5qZLQYl}nTOJIWQ%IR*%jDCBk-puF zoT{qiN&vrivWMr%nF9c%{Mj@EKl36707PV^qo1=xq@W&s1#U865BSraqW8Q$$b%1_ zo(Ax}0#(<+9%nl(OUF*}*La4#0m0XtY{WGE8z)U#y{Ff96N&L31GvS_ z`~8VJ0OOtPE%A8ZEw3^h=z5tB;7S0C^b2yjiHPhA;HOU30&p&XF=cv~9Ek8oPBIDL zSCx=aSGUX%;k8covvk_Hp^VLs0WcHUN%(3J*+;)1{~bIB+1|XA-z@e1Xjwl+Qn zz;RCYzIZ(F_j1bsfK8i5P6u#^lf7}xnj=3iSKA~llJ@{u>J<3^oULDzuUx&dxm*vj z^MlD{u*X_jmb$U0>pjR>;3Ply`BF_45);wVXXzCg<%&oRfYDC&Es}4b0+=wNZ9n%R zNO?Sw7{8&1TlxgQf6cKXtaY-@t5!|?q7rgQz7F7RCp%U|c6ICYo;y~{uc^d?0Dxtw z%i(Ns4I^F7F)#<(I=2hJ;!4SJ;Oa}k)?+5);s7mOt{F846DGt*0+{1uSG2a)dR{{p zmeW=%LU&BRKOPTku7n&7`rCJ)o6wc38#_L(#1mMS{0u;~omN*?LWa}!6>bv1O_h?P z;S()A#7*ey+@V_fvvNzXt&M*ZknaNkDa#Xy@vj0vAdomlDeIh6%6EOge(*dl1_Fs= zL~yE;y+0lgyyn(To3^6=?z?}SbT&2RM7x&G=uz*vL$tJhv1g1Mw|*B=+}V1&VOSZP zsR(~f%K3(ozM~AH28eKplkOyCp<7W~8y~Z6Tg^W=Z5p}m;Da~rdMt^ zt=mvW_nZM0bO4Od#*l%8hw!v z44R_?fy6=}Gg|||77FZ?Eq#Vca?HhZEgHSk)4OGKp=5YFlb(G80oh~ zc(POcF@R%D)BjR2IZp4bgUx#oIG>-CeI`t3+pE;iSpYpMw^K{^dNvi|9{_9=;We;H z6(Hp)rK}01@D}3nz+0qD1Ms0!G=Y>q1On#vV9$@8$~$NH2o*G zw$}cm)F(7e|N5@3>M4I(_xGy})zD}$a8j|Wy;mh!{i7VCD| zVaOf#3woUkprCSng$hxka(%w!D_vdH50SFJ2tN|x`gGcOX3d)MFZb2jzkdDLb^zgE z(7Z~7CzCQ+ga?qa8|!{9YQ#uJ^v;K8FF15gV0SMQ8gw}mLN=cdy++}WaU34*3QnYE9>i94@8N$@WSy70nv`=CUG z!{*>n$ZS!Hev^XY4x3*A z&?bTlV83!*BOM=tv}ZQ7qMp)`^&r?}>HYKW@jCJ9UwlT7x^f3=={_H_D1dr6J|>b^(^w6FNo{>MeG| zz*Z8sIoZ*3<|L-*^_w@F^8kJW07(4Ni!2eiq~{^k)#>?qT{vupMQk#l2%c$g-&Bx@ z-ACx1G4rMB${netKPaPMSXS5VuzPRPFeLNE)z`1ymxODy-lUW<&x;(Ts{a>`y%s=b zVhV@N5zdMN06s{iy3Q)Ilk$BnozbJN+|gRPtSp$8mZLtas!AW`^I4;so5!zn%MHVK zH5_Z22*%B2q(go7;2GM%MW7T)Bl8ejw{daK7AP2SmyP^>f&7Xc|!E9B1?4 z>a9P3Isi~2at>^(zKgpn_y7Qai!N#%2Jl=vCo5yI;;tDl2!~C-2u=h55!}<U z#sIjvvJZg^;OxXicw;%tl7)~;$pScIx?%8_m5`B#TaZ0Lv>m_`isY#PCL`;pn<@^`n|ZZ0K#E&ctb<{d*`3OqAY$p+nh*ur{7|p2>CFYInR88Kl$c=i#u5 zL{QG1ny#n6>J|CQEl5B2fL%q@a@gN&(z2v37RzNiHZ=ub0c5$IlOCgd^nm`bkNeMz zL~;+A*mC8!06^mSaM(P>qbwK0wRDa_s5|8jfp-jmy;~uGONwoC9U6b14A&&`jpA2E zqlJ(AD`i{_+g2?A?m!<1hXqjdEJ@jw8xEU(5&^Bq0(fC05-56tap}_FCX)5g=$9fK zH*emCU5nil3bh>`4#ywXUU34rqPaOxbUWSDH0}ct&UdaIf^5haFVOXrquiX(S0XY4 zz(Y=U5y`8&Yc&*#e_KSJv~zqKko}Q0T>`+TB*W2Y@ZnOPG88iBir{u+Nf28BJ_pF# z0KN~~_w|wUA#^$0NY0B!gZGpu=Q0u8fNbYVl2{{xH(}d6P6q&`IIXEExT3qZMC50% zd3FFWo#dl#U4G9kNa8I9<*+iGTTejt1Czu*_|CILw1}-qD};pU@O!`5_TA@M*!RxjRyW&u%R9h zjryOdt6O&1(4nKgM_S(nWHM~GzPqgYXDa0@#Y9>3xI_N1a60TLApp=w@|vDno;h>< zE;TisHHI;yy)U$-QV8=_8pF|>o?LFNi zBG)1tqvAA@**7(hXL=^GJkzWH-MrWT4}Soz)^v8JUY9@S>fEo2gkSj&MA5S(`D1Qj z`gsvVqvk+B?gn@OfRqTfkl0Ad%C4^DpXSf^9^)VuYpb;^bt-_dBHRlQA+c3N-XpD* zBJx-$6!@gahPpvx;|ky2d+$1x#PNU{18@(38#_C9J8%B{9*4T3q;XHm72#wuYY}pX zZxQ~eUq9dWNW_e#Qt3tW=1ustSR2Kfjzr?KMPvb>4$aa>#c3Y^2oaeL;PzO|ytAsx zxMJ3^LLeR=lmqx-^@ z9oyA4sGv#ljyu-uO{&LWZ+a|v56lU4!^rPWxn+qb*3mI|0RT5{*bno2?`^kj><2sk zV$qnly?`T!KaNIone5A!ZKzJAxD1Z{RRljo*6ZyB|32l;pMOxNNXA{i|6Y6Tn&Sni zuHJejoTz&we(Ob+NG4vx01@{fjE(m89X|#**}e{U*VhMr?M0S`q6gCRm0R%OgjbM= z0eA<%=RTjxjGxh{d6)>?5LIi^Y3quD4U{a0kCOK38!{7d56&d#y7b@t^2@G*%=12? zPd&G6WQ|ZLuzkpoO<(Qo9K6>B7i5_)B3K05P(r0-ZgsWg$syX=nIfPr0OSq;8Fvz? zehqsjC#f1Pz4WMZ?mmn5-2BweHbewrJI7ZmFa`nu zT3QlE0CdONdz_tPC7^HXb37OM%6%VM4{)t>R-&x%`MT!AIb=ejV!f8+<#5JyR4B5v zhztVY8l^rXsh{e}RW5DQ`eROZ0PF*LWy6vst;0yT;k4h_vZXS_=V;U%McU^QMDSE7 zRJJ(f`Iv{SC2@IvWuXJqdQ8P%h)CJK_3CQlN;oGBMdYUo7nWxixHDt{oHZv|nJA== zUR>g!h1>|x019}2Nxmc^cLJCNd(t@t$n>^8BN{cw!lCPnptZiMn4v;kN0}x23tB+DDHNV@2dO%H0Kh@QhS_=C^QzWzH#nPp1`}uroxd=9X z_+j9TGQ2g_u+5FvIe$B$U{@iTblmDg0i5LVA<*7_;I$6_03uit3T1<-`&24B)=r->8B23 zvBaqwqx@~d=(tL+C;1E5W0sKgNFmMqHqSpgI#M|{UFR1Ew*i1<(KGp(LM{&=7nbpnGIsoJV;IK!;`Vw0zHVC=}Q#Vh6qu!K7$39`-1! zq=%^_s$m;|x=5Z=Ute4B=MgNc`f}I=yI!HO5sJ(?ZBJ6_8yX6q-l?iG76I5|=j2vY zIDT4LmD||3!WWChXNX{`R=fki^l&)%SdrV%&~U^*VY3+Zv1?jOOX0vnx81g}Uo4jR z?uZd)t4L-o-bM24aM*uup?d&e)~q8x7m-IRn|FpM#NA>~?^^95A42bg7DUKL)rCkf)N#^n-n1+2o+5B{9jetTRZNB!UBxrKx^IVy#kqG?^Uu z=-j!ZJ;kYMZjO&P3^~cNFj)knV4GS819+peGd0BvCF}nHI#uMbL3!;I00000NkvXX Hu0mjfpS2qn literal 0 HcmV?d00001