-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name tokenizer codec clarification #802
Labels
Comments
Thanks for this. That's a good point and it's worth being explicit given in CRAM they could have been encoded with either a BYTE_ARRAY_LEN having an explicit second data series to mark the ends of each read name, or with a BYTE_ARRAY_STOP where an inline termination symbol is used (like C-strings). I'll double check the code and make a PR to clarify accordingly. |
jkbonfield
added a commit
to jkbonfield/hts-specs
that referenced
this issue
Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802
jkbonfield
added a commit
to jkbonfield/hts-specs
that referenced
this issue
Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm trying to finish up Yash's CRAM 3.1 codecs for htsjdk, and I came across an ambiguity in the name tokenizer spec that could use clarification. It currently states "The serialised data stream starts with two unsigned little endian 32-bit integers holding the total size of uncompressed name buffer and the number of read names".
Based on the values I see when inspecting samtools CRAMs/code, the uncompressed name buffer size assumes that the uncompressed names are formatted into a buffer the way htslib formats them, i.e., on decode, it would match the length of the read name data that is reconstructed from the stream(s), PLUS one byte for a separator for each read name, including a terminal separator. I don't think that's stated anywhere in the spec, so if correct, it would be useful to state explicitly.
The text was updated successfully, but these errors were encountered: