Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite? #74

Open
rob-p opened this issue Mar 29, 2024 · 15 comments
Open

Test suite? #74

rob-p opened this issue Mar 29, 2024 · 15 comments

Comments

@rob-p
Copy link

rob-p commented Mar 29, 2024

I’m trying to build a (minimal) seqcol implementation in rust here. Right now, the interface and API are as minimal as can be for what we need (building a seqcol digest from a SAM header for our long read quantification tool, oarfish). However, it may be generally useful to the community to have such a library in rust as that language continues to quickly gain adoption in bioinformatics.

to this end, is there some sort of “test suite” against which one can run an implementation to check compatibility with the reference implementation?

Thanks!
Rob

Cc @mikelove

@nsheff
Copy link
Member

nsheff commented Jan 8, 2025

Yes, I am working in on a test suite here: https://github.com/refgenie/refget?tab=readme-ov-file#compliance-testing-of-sequence-collections-api

But are you talking about a testing of the API? It seems more likely you may be looking for a test that just compares a digest for a collection, is that right?

Also, in other news; I've also written a rust helper library that will compute seqcol digests. I didn't (and am not planning to) implement the web API in rust, but if all you need is the digest computations in rust, then you might be able to use that... It will be available in the v0.2.0 release of gtars planned for later this month.

@sveinugu
Copy link
Collaborator

sveinugu commented Mar 20, 2025

@nsheff

I have tried to implement the seqcol calculations independently, and these are the results I get for the test examples at level 1:

{
  'base.fa': {
    'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG',
    'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
    'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
    'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq',
    'sorted_name_length_pairs': 'zjM1Ie9m0zFbqsAnZ6jAJSXuFpKTr40J'
  },
  'different_names.fa': {
    'names': 'lrCv6NNXom7AC9tKFWqhcLLZsrcgJIqq',
    'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
    'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
    'name_length_pairs': 'm88geMfgGBZ7VpYgQKCB4P9z-mhKJ-nj',
    'sorted_name_length_pairs': '1FQEGOQQ-m0NmZ0R-eeJEfnH1ayqJQ0T'
  },
  'different_order.fa': {
    'names': 'dOAOfPGkf3wAf3CUsbjVTKhY9Wq2DL6f',
    'lengths': 'x5qpE4FtMkvlwpKIzvHs3a02Nex5tthp',
    'sequences': '7t6Ulz6OeUWu6FBxntbvFKOl8w3icl2h',
    'name_length_pairs': 'a6JbVltjGqj5fEr01M0qjqCmlLLQ_P7N',
    'sorted_name_length_pairs': 'zjM1Ie9m0zFbqsAnZ6jAJSXuFpKTr40J'
  },
  'pair_swap.fa': {
    'names': 'gSWbV6khfIsnlQTyw1PmlQ8G7VRfIWbU',
    'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
    'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
    'name_length_pairs': 'yjUFKuKCURANxHar4JDF5ABOn6FJ-T8m',
    'sorted_name_length_pairs': 'rL5OQOnFba8yyz7lS-0-hgZvwcQsiajN'
  },
  'subset.fa': {
    'names': 'iyNUhtfR0TALytlmxK1Zx1_q3frkZyAd',
    'lengths': '7-_HdxYiRf-AJLBKOTaJUdxXrUkIXs6T',
    'sequences': '3ZP38SZcoc9wN7jsRyNSP9mQ1a3TUoUF',
    'name_length_pairs': 'b_TLfweI2gfClgj57gcTFtEOMJ_daWd4',
    'sorted_name_length_pairs': 'AvGYsdgtJpTLKYil3eLddJxwrE5OfKhE'
  },
  'swap_wo_coords.fa': {
    'names': 'QX5ur-faw5nXis8HXUK2kMxgY5MTGVRn',
    'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
    'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
    'name_length_pairs': 'suXpFjcxpyUDOkBgNEakNEXtLlyxtjJr',
    'sorted_name_length_pairs': 'zjM1Ie9m0zFbqsAnZ6jAJSXuFpKTr40J'
  }
}

Top level digests are fine:

{
  'base.fa': 'XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk',
  'different_names.fa': 'QvT5tAQ0B8Vkxd-qFftlzEk2QyfPtgOv',
  'different_order.fa': 'Tpdsg75D4GKCGEHtIiDSL9Zx-DSuX5V8',
  'pair_swap.fa': 'UNGAdNDmBbQbHihecPPFxwTydTcdFKxL',
  'subset.fa': 'sv7GIP1K0qcskIKF3iaBmQpaum21vH74',
  'swap_wo_coords.fa': 'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH'
}

But the sorted_name_length_pairs digests do not match the ones in the test_fasta_digests.json file:

[
  {
    "name": "base.fa",
    "digest": "XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk",
    "sorted_name_length_pairs_digest": "wwE4PUok50YyEF2Ne8BBA5__zk92CZH8"
  },
  {
    "name": "different_names.fa",
    "digest": "QvT5tAQ0B8Vkxd-qFftlzEk2QyfPtgOv",
    "sorted_name_length_pairs_digest": "BJx73ZGp4zfRKaf6i6j9cJH02tWW5PFb"
  },
  {
    "name": "different_order.fa",
    "digest": "Tpdsg75D4GKCGEHtIiDSL9Zx-DSuX5V8",
    "sorted_name_length_pairs_digest": "wwE4PUok50YyEF2Ne8BBA5__zk92CZH8"
  },
  {
    "name": "pair_swap.fa",
    "digest": "UNGAdNDmBbQbHihecPPFxwTydTcdFKxL",
    "sorted_name_length_pairs_digest": "2prbncALoH4cFy1Nhlu_e3all-VVuUZr"
  },
  {
    "name": "subset.fa",
    "digest": "sv7GIP1K0qcskIKF3iaBmQpaum21vH74",
    "sorted_name_length_pairs_digest": "NQw5zO172a6q9OHBGwOrZCWl1pzyx94s"
  },
  {
    "name": "swap_wo_coords.fa",
    "digest": "aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH",
    "sorted_name_length_pairs_digest": "wwE4PUok50YyEF2Ne8BBA5__zk92CZH8"
  }
]

Also, there are no name_length_pairs_digest to compare with.

  • Add digests for name_length_pairs_digest to the test suite

The issue might very well be on my side, as I have no other digests to compare with either. I did try to calculate the main example in the specification, but I also realized that the digests of the name_length_pairs and sorted_name_length_pairs have not been calculated for the main example, which would have been very useful as a simple check.

  • Calculated digests for name_length_pairs and sorted_name_length_pairs for the main example in the specification

@sveinugu
Copy link
Collaborator

Btw: I created the seqcols based on the first three columns in the .fa.checksums files, e.g. base.fa.checksums. I have not parsed the FASTA files myself, but I should probably also do that to double-check also that part.

@sveinugu
Copy link
Collaborator

Another TODO:

  • Add digests for sorted_sequences to the test suite

@nsheff
Copy link
Member

nsheff commented Mar 20, 2025

we're matching on all the level1 digests except sorted_name_length_pairs.

seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk", level=1)
{'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
 'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG',
 'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
 'sorted_sequences': 'KgWo6TT1Lqw6vgkXU9sYtCU9xwXoDt6M',
 'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq',
 'sorted_name_length_pairs': 'wwE4PUok50YyEF2Ne8BBA5__zk92CZH8'}

@mikelove
Copy link

Rob and I are starting to use seqcol digest (as implemented in Rust at the top of the issue) as part of oarfish -> tximeta workflow for provenance detection. We are going to keep using sha256 in addition, as that is the digest we use to identify the provenance for some years now, but hoping that the seqcol will provide forward compatability with APIs from Ensembl / GENCODE / RefSeq. Maybe in a few months we could demo the use case to the working group.

@sveinugu
Copy link
Collaborator

@mikelove A demo of this would be great once you have it working.

Sorry for spamming you with notifications. Decided to take over this issue with a tangential issue instead of creating a new one, but didn't consider the notifications.

@mikelove
Copy link

No worries, good for me to watch along.

@rob-p
Copy link
Author

rob-p commented Mar 20, 2025

@sveinugu : Can you point me at where the data are that you are using for you testing? I'd be interested in cross-checking our rust implementation as well, to ensure that everyone is getting the same answers before we really start relying on the seqcol digests downstream in tximeta.

@nsheff
Copy link
Member

nsheff commented Mar 20, 2025

See also:

@sveinugu
Copy link
Collaborator

sveinugu commented Mar 20, 2025

@nsheff I think the issue is here: https://github.com/refgenie/refget/blob/9feb67191a48c16e627140733938c8fa6c7886c8/refget/models.py#L203

Basically, it seems you have not reverted from the sorted_name_length_pairs algorithm with human-readable intermediates that we tested out and back to the digest-twice algorithm which is now in the specs. Canonicalizing and digesting the output of build_sorted_name_length_pairs() (line 199) gave the same output as I got when I tested manually using your library, but that line is now commented out.

@nsheff
Copy link
Member

nsheff commented Mar 20, 2025

Sounds right, I don't think I ever went back and changed that. I'll update that. great, we're almost there, then.

@sveinugu
Copy link
Collaborator

Great. I also added sorted_sequences and got the same digest as you.

@rob-p
Copy link
Author

rob-p commented Mar 22, 2025

Hi @nsheff & @sveinugu,

Thanks for the help! The separate implementation I've made lives in the seqcol_rs crate which now agrees with your digests on the test data, Sveinung. We use this library for generating the digests for oarfish, but it's not too much use on it's own as a library.

To that end, I've also made a crate rsrs (reference signatures in Rust) that is available here and also on crates.io. Basically, it just lets one create an appropriate JSON object (different level digests) from different kinds of input directly on the command line. It's a convenient way to generate the relevant digest information locally using the seqcol_rs library.

Let me know if you have any feedback, thoughts or suggestions on either of the above!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants