-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cram: Define bit stream value #812
Comments
I don't find that example particularly clear, as it mixes binary and hex together and doesn't record how many bits per byte are currently occupied. I found the same text in my ancient copy of the CRAM v1.1 spec though, so it's likely never been rewritten. Back then most CRAM data was all multiplexed into the one CORE stream, with I think only QS and RN being "external". That's also where all the mix of huffman, golomb, golumb_rice, beta, gamma, and subexponential encodings came from. They couldn't write to specific blocks and all interleaved into CORE. That's a shame as actually they're not necessarily a bad strategy for separating into "external" blocks, but that had to be done via ITF8 or similar instead. I wasn't involved in CRAM of that era though, but I think the theory was it could be indexed with one byte/bit offset per record to permit genuine read-level random access of the CORE bit stream. I'm not convinced this section is even in the correct place. It may belong with the bit-packing encodings such as BETA and HUFFMAN, as everything else is byte oriented using fixed size values such as uint32 or variable data via ITF8 and LTF8. I don't understand quite what you were proposing. What do you think for splitting section 13 (encodings) to bit-based encodings to CORE block and byte based encodings to external blocks, with the former then having an introductory paragraph describing the bit encoding order. I cannot think of anywhere else that we write out bits so it's quite specific to those encoding methods. I think. Edit: although we do have "writing bits to a stream" followed by "writing bytes to a stream", and the latter is needed there, so maybe it's correct they're together. Hmm... I'll look at clarifying that example at least! |
I don't understand this either:
I can't think of anywhere where we store the fact that only 12 bits are meaningful. Rather I think it's just implicit it how we decode. The codecs, such as beta / gamma inform the number of bits to use, or huffman explicitly via the known code lengths. Basically we're reading a bit at a time and interpretting it. So
At that stage we've consumed 12 of the 16 bits, but we didn't need to explicitly know the last 4 bits are invalid. They're simply unused as the decoding has now finished. |
Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use. I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume. Fixes samtools#812.
This is in regard to CRAM format specification (version 3.1) (2024-09-04).
§ 2.2 "Writing bits to a bit stream" does not define what a value is. This is required to know how to interpret the bits and the maximum number of bits that can be written to the stream from said value.
A single value type can be generally defined for the bit stream (e.g.,
int32_t
>= 0), or the encodings that use the bit stream (e.g., beta) can describe their own appropriate value types.The text was updated successfully, but these errors were encountered: