Skip to content

Commit 7feea69

Browse files
committed
update comparison explanation
1 parent 6009966 commit 7feea69

File tree

1 file changed

+12
-22
lines changed

1 file changed

+12
-22
lines changed

docs/seqcols/compare_collections.md

Lines changed: 12 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,30 @@
22

33
## Use case
44

5-
- You have a local sequence collection, and a digest for a collection in a server. You want to compare the two to see if they have the same coordinate system.
65
- You have two digests for collections you know are stored by a server. You want to compare them.
6+
- You have a digest for a collection from a server, and a local sequence. You want to compare the two to see if they have the same coordinate system.
77

88
## How to do it
99

1010
You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections.
11-
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples.
11+
You can also `POST` a local collection to `/comparison/:digest1` to compare it to a single remote collection.
12+
The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret.
13+
Here are some examples.
14+
15+
The best way is to use the Refget [SeqCol Comparison Interpretation Module (SCIM)](https://refget.databio.org/scim/).
16+
You paste in the JSON output of a comparison, and it provides an interpretation for you.
17+
18+
## Interpretation details
1219

1320
### Strict identity
1421

1522
Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order.
1623
For example, aligning with bowtie2 against sequence collections that differ in either content, name, or order will not necessarily produce the same output.
1724
Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
1825

19-
For this simple comparison, you don't need the `/comparison` endpoint -- it can easily be done by simply comparing the top-level digest.
26+
For this simple comparison, you don't need the `/comparison` endpoint -- just compare the top-level digests.
2027
**Two collections will have the same digest if they are identical in content, names, and order for all `inherent` attributes.**
2128
Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute.
22-
If you have a local sequence collection and a digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.
2329

2430
### Order-relaxed identity
2531

@@ -54,23 +60,7 @@ How to assess: We will ignore the `sequences` attribute, but ensure that the `na
5460
If the `a_and_b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system.
5561
If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a_and_b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system.
5662

57-
### Others...
58-
59-
There are also probably other types of compatibility you can assess using the result of the `/comparison` function.
60-
Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case.
61-
62-
## Complex cases: distinguishing out-of-order from mismatched arrays
63-
64-
One challenge is to identify issues where the *set* of sequences and names both match, but some of the pairs have been swapped.
65-
For example, what would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
66-
67-
Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
68-
Similarly, the names arrays contains the same names, and so all will be counted as a match.
69-
However, the same_order will *not* be true; it will yield false for some of the attributes.
70-
71-
This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing.
72-
73-
In this particular example, these results can be distinguished by the `sorted_name_length_pairs` attribute, because this would yield a perfect match for the second example, where all the pairs are intact but in a different order -- but it would NOT yield a match for the example with swapped names, because the name-length pairs would be different.
63+
## Complex cases
7464

75-
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent collated attribute, similar to `sorted_name_length_pairs`, but including *all* collated attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections.
65+
For more complex cases, the comparison function and the level1 digests can sometimes be used to figure out what is going on, but they are limited by design -- for situations that are more complex than these methods can handle, it is always possible to look deeper at the contents of the sequence collection and compare them directly.
7666

0 commit comments

Comments
 (0)