-
Notifications
You must be signed in to change notification settings - Fork 9
Last hour fixes #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Last hour fixes #91
Changes from all commits
a1e0591
93f7e4f
de26248
ebdb2eb
095516e
62356dd
cc9d2bc
d1abbdb
54ed980
fc9624d
1d5763b
93dea51
bbf7a86
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com | |||||
|
||||||
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | ||||||
|
||||||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | ||||||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another consequence of changing the example (see https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123) |
||||||
|
||||||
Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||||||
Similarly, the names arrays contain the same names and so all will be counted as a match. | ||||||
However, the same_order will *not* be true, so it will yield false. | ||||||
However, the same_order will *not* be true; it will yield false for all attributes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is again incorrect. Here's an example: it does not yield false for all attributes. The lengths and sequences are in the same order. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so it will yield false only for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree EDIT: I agree, based on your original example. My idea was to change the example slightly (which also unfortunately makes it more complex) so that the line "This is the same output as a comparison of two sequence collections in different orders" becomes true (see comment below) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is another consequence of changing the example slightly |
||||||
|
||||||
This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My main problem is with this line. In the example where everything is the same except for a name swap, then the output will NOT be the same as a comparison of two sequence collections in different order. As you mention above, in the main example, only the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes you are right. I think this may be an outdated explanation from before we split the That removes this particular limitation with the compare function. |
||||||
|
||||||
In this particular example, these results can be distinguished by the `sorted_name_length_pairs` attribute, because this would yield a perfect match for the second example, where all the pairs are intact but in a different order -- but it would NOT yield a match for the example with swapped names, because the name-length pairs would be different. | ||||||
|
||||||
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent array, similar to `sorted_name_length_pairs`, but including *all* inherent arrays for each element rather than just the names and lengths. As this would be automatically picked up by the comparison function, it would immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections. | ||||||
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent collated attribute, similar to `sorted_name_length_pairs`, but including *all* collated attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections. | ||||||
sveinugu marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This correctly is not correct. This example was not about different orders, just about a name swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to change the example slightly, see: https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123