Skip to content

Commit 3b80075

Browse files
committed
some cleanup on interpretations
1 parent 4d4c702 commit 3b80075

File tree

1 file changed

+18
-12
lines changed

1 file changed

+18
-12
lines changed

docs/compare_collections.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,41 +8,46 @@
88
## How to do it
99

1010
You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections.
11-
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples
11+
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples.
1212

1313
### Strict identity
1414

1515
Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order.
16-
For example, aligning against sequence collections that differ in any aspect (sequence name, order difference, etc) with bowtie2 will not necessarily produce the same output.
16+
For example, aligning with bowtie2 against sequence collections that differ in either content, name, or order will not necessarily produce the same output.
1717
Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
1818

19-
This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint.
20-
**Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.**
19+
For this simple comparison, you don't need the `/comparison` endpoint -- it can easily be done by simply comparing the top-level digest.
20+
**Two collections will have the same digest if they are identical in content, names, and order for all `inherent` attributes.**
2121
Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute.
2222
If you have a local sequence collection and a digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.
2323

2424
### Order-relaxed identity
2525

26-
A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question:
26+
A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical sequence names and content, but differ in order".
27+
Relying on top-level digests will not work for this comparison, but you can answer this question using `/comparison` return value:
2728

2829
Two collections meet the criteria for order-relaxed identity if:
2930

3031
1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements).
3132
2. this value is the same as `elements.a_and_b.<attribute>` for all attributes (the content is the same)
3233
3. any entries in `elements.a_and_b-same-order.<attribute>` may be true (indicating the order matches) or false (indicating the order differs)
3334

34-
Then, we know the sequence collection content is identical, without controlling for order.
35+
Then, we know the sequence content and names are identical, but not in the same order.
3536

3637
###### Name-relaxed identity
3738

3839
Some analysis (for example, a [`salmon` RNA pseudo-alignment](https://salmon.readthedocs.io/en/latest/salmon.html)) will be identical regardless of the chromosome names, as it considers the digest of the sequence only.
3940
Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ."
4041

41-
How to assess: As long as the `a_and_b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical
42+
There are two convenient ways to answer this question.
43+
First, you can use the attribute (level1) digest, for the `sorted_sequences` attribute.
44+
If this digest matches, then you know you have identical sequence content, without controlling for names or sequence order.
45+
46+
Second, you can also answer this question using the `/comparison` function. As long as the `a_and_b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical.
4247

4348
###### Length-only compatible (shared coordinate system)
4449

45-
A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ.
50+
A much looser type of compatibility is two sequence collections that have the same set of sequence lengths, though the sequences themselves may differ.
4651
In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.
4752

4853
How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a_and_b` match what we expect from `elements.total`.
@@ -54,13 +59,14 @@ If however, their coordinate system matches but is not in the same order, then w
5459
There are also probably other types of compatibility you can assess using the result of the `/comparison` function.
5560
Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case.
5661

57-
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays
62+
## Complex cases: distinguishing out-of-order from mismatched arrays
5863

59-
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
64+
One challenge is to identify issues where the *set* of sequences and names both match, but some of the pairs have been swapped.
65+
For example, what would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
6066

6167
Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
62-
Similarly, the names arrays contain the same names and so all will be counted as a match.
63-
However, the same_order will *not* be true; it will yield false for all attributes.
68+
Similarly, the names arrays contains the same names, and so all will be counted as a match.
69+
However, the same_order will *not* be true; it will yield false for some of the attributes.
6470

6571
This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing.
6672

0 commit comments

Comments
 (0)