Skip to content

Clarify the name tokeniser uncomp_len calculation (PR #803)#803

Merged
jkbonfield merged 1 commit intosamtools:masterfrom
jkbonfield:name-tok-size
Mar 24, 2025
Merged

Clarify the name tokeniser uncomp_len calculation (PR #803)#803
jkbonfield merged 1 commit intosamtools:masterfrom
jkbonfield:name-tok-size

Conversation

@jkbonfield
Copy link
Contributor

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0').

Fixes #802

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of bece1f7: CRAMcodecs (diff).

@cmnbroad
Copy link
Contributor

cmnbroad commented Jan 7, 2025

Looks good - thanks!

@jkbonfield jkbonfield added sam cram and removed sam labels Jan 7, 2025
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of 4982e03: CRAMcodecs (diff).

CRAMcodecs.tex Outdated
Comment on lines 2455 to 2456
the number of read names. This is followed the array elements
themselves. Note the uncompressed size is calculated as the sum of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialised data stream starts with two unsigned little endian 32-bit integers... This is followed the array elements themselves.

This is unrelated to the length calculation, but note there is also a 1 byte flag between the 2 integers and the data stream:

Bytes Type Name
4 uint32 uncomp_length
4 uint32 num_reads
1 uint8 use_arith

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It also had a poor and nebulous "array elements" term which I expanded on.

jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jan 28, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802

Also clarify the name tokeniser serialisation description.
Acknowledge the 1-byte "use_arith" field and replace the nebulous
"array elements" with a more descriptive text about token streams.
@github-actions
Copy link

Changed PDFs as of 7de0ae0: CRAMcodecs (diff).

This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802

Also clarify the name tokeniser serialisation description.
Acknowledge the 1-byte "use_arith" field and replace the nebulous
"array elements" with a more descriptive text about token streams.
@jkbonfield jkbonfield merged commit 7b61997 into samtools:master Mar 24, 2025
1 check passed
@github-project-automation github-project-automation bot moved this from Needs review to Done in GA4GH File Formats Mar 24, 2025
@github-actions
Copy link

Changed PDFs as of 7b61997: CRAMcodecs (diff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

Name tokenizer codec clarification

3 participants