Skip to content

Name tokenizer codec clarification #802

@cmnbroad

Description

@cmnbroad

I'm trying to finish up Yash's CRAM 3.1 codecs for htsjdk, and I came across an ambiguity in the name tokenizer spec that could use clarification. It currently states "The serialised data stream starts with two unsigned little endian 32-bit integers holding the total size of uncompressed name buffer and the number of read names".

Based on the values I see when inspecting samtools CRAMs/code, the uncompressed name buffer size assumes that the uncompressed names are formatted into a buffer the way htslib formats them, i.e., on decode, it would match the length of the read name data that is reconstructed from the stream(s), PLUS one byte for a separator for each read name, including a terminal separator. I don't think that's stated anywhere in the spec, so if correct, it would be useful to state explicitly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions