You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/compare_collections.md
+18-12Lines changed: 18 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,41 +8,46 @@
8
8
## How to do it
9
9
10
10
You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections.
11
-
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples
11
+
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples.
12
12
13
13
### Strict identity
14
14
15
15
Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order.
16
-
For example, aligning against sequence collections that differ in any aspect (sequence name, order difference, etc) with bowtie2 will not necessarily produce the same output.
16
+
For example, aligning with bowtie2 against sequence collections that differ in either content, name, or order will not necessarily produce the same output.
17
17
Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
18
18
19
-
This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint.
20
-
**Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.**
19
+
For this simple comparison, you don't need the `/comparison` endpoint -- it can easily be done by simply comparing the top-level digest.
20
+
**Two collections will have the same digest if they are identical in content, names, and order for all `inherent` attributes.**
21
21
Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute.
22
22
If you have a local sequence collection and a digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.
23
23
24
24
### Order-relaxed identity
25
25
26
-
A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question:
26
+
A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical sequence names and content, but differ in order".
27
+
Relying on top-level digests will not work for this comparison, but you can answer this question using `/comparison` return value:
27
28
28
29
Two collections meet the criteria for order-relaxed identity if:
29
30
30
31
1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements).
31
32
2. this value is the same as `elements.a_and_b.<attribute>` for all attributes (the content is the same)
32
33
3. any entries in `elements.a_and_b-same-order.<attribute>` may be true (indicating the order matches) or false (indicating the order differs)
33
34
34
-
Then, we know the sequence collection content is identical, without controlling for order.
35
+
Then, we know the sequence content and names are identical, but not in the same order.
35
36
36
37
###### Name-relaxed identity
37
38
38
39
Some analysis (for example, a [`salmon` RNA pseudo-alignment](https://salmon.readthedocs.io/en/latest/salmon.html)) will be identical regardless of the chromosome names, as it considers the digest of the sequence only.
39
40
Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ."
40
41
41
-
How to assess: As long as the `a_and_b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical
42
+
There are two convenient ways to answer this question.
43
+
First, you can use the attribute (level1) digest, for the `sorted_sequences` attribute.
44
+
If this digest matches, then you know you have identical sequence content, without controlling for names or sequence order.
45
+
46
+
Second, you can also answer this question using the `/comparison` function. As long as the `a_and_b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical.
A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ.
50
+
A much looser type of compatibility is two sequence collections that have the same set of sequence lengths, though the sequences themselves may differ.
46
51
In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.
47
52
48
53
How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a_and_b` match what we expect from `elements.total`.
@@ -54,13 +59,14 @@ If however, their coordinate system matches but is not in the same order, then w
54
59
There are also probably other types of compatibility you can assess using the result of the `/comparison` function.
55
60
Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case.
56
61
57
-
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays
62
+
## Complex cases: distinguishing out-of-order from mismatched arrays
58
63
59
-
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
64
+
One challenge is to identify issues where the *set* of sequences and names both match, but some of the pairs have been swapped.
65
+
For example, what would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
60
66
61
67
Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
62
-
Similarly, the names arrays contain the same names and so all will be counted as a match.
63
-
However, the same_order will *not* be true; it will yield false for all attributes.
68
+
Similarly, the names arrays contains the same names, and so all will be counted as a match.
69
+
However, the same_order will *not* be true; it will yield false for some of the attributes.
64
70
65
71
This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing.
0 commit comments