Name tokenizer codec clarification #802

cmnbroad · 2024-12-10T23:07:30Z

I'm trying to finish up Yash's CRAM 3.1 codecs for htsjdk, and I came across an ambiguity in the name tokenizer spec that could use clarification. It currently states "The serialised data stream starts with two unsigned little endian 32-bit integers holding the total size of uncompressed name buffer and the number of read names".

Based on the values I see when inspecting samtools CRAMs/code, the uncompressed name buffer size assumes that the uncompressed names are formatted into a buffer the way htslib formats them, i.e., on decode, it would match the length of the read name data that is reconstructed from the stream(s), PLUS one byte for a separator for each read name, including a terminal separator. I don't think that's stated anywhere in the spec, so if correct, it would be useful to state explicitly.

jkbonfield · 2025-01-07T09:35:50Z

Thanks for this. That's a good point and it's worth being explicit given in CRAM they could have been encoded with either a BYTE_ARRAY_LEN having an explicit second data series to mark the ends of each read name, or with a BYTE_ARRAY_STOP where an inline termination symbol is used (like C-strings).

I'll double check the code and make a PR to clarify accordingly.

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802 Also clarify the name tokeniser serialisation description. Acknowledge the 1-byte "use_arith" field and replace the nebulous "array elements" with a more descriptive text about token streams.

cmnbroad added the cram label Dec 10, 2024

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jan 7, 2025

Clarify the name tokeniser uncomp_len calculation (PR samtools#803)

bece1f7

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802

jkbonfield linked a pull request Jan 7, 2025 that will close this issue

Clarify the name tokeniser uncomp_len calculation (PR #803) #803

Open

jkbonfield added the sam label Jan 7, 2025

jkbonfield removed this from GA4GH File Formats Jan 7, 2025

jkbonfield moved this to New items in GA4GH File Formats Jan 7, 2025

jkbonfield added this to GA4GH File Formats Jan 7, 2025

jkbonfield removed the sam label Jan 7, 2025

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jan 7, 2025

Clarify the name tokeniser uncomp_len calculation (PR samtools#803)

4982e03

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name tokenizer codec clarification #802

Name tokenizer codec clarification #802

cmnbroad commented Dec 10, 2024

jkbonfield commented Jan 7, 2025

Name tokenizer codec clarification #802

Name tokenizer codec clarification #802

Comments

cmnbroad commented Dec 10, 2024

jkbonfield commented Jan 7, 2025