Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cram: Define bit stream value #812

Open
zaeleus opened this issue Feb 6, 2025 · 2 comments · May be fixed by #820
Open

cram: Define bit stream value #812

zaeleus opened this issue Feb 6, 2025 · 2 comments · May be fixed by #820
Labels

Comments

@zaeleus
Copy link

zaeleus commented Feb 6, 2025

This is in regard to CRAM format specification (version 3.1) (2024-09-04).

§ 2.2 "Writing bits to a bit stream" does not define what a value is. This is required to know how to interpret the bits and the maximum number of bits that can be written to the stream from said value.

A single value type can be generally defined for the bit stream (e.g., int32_t >= 0), or the encodings that use the bit stream (e.g., beta) can describe their own appropriate value types.

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 25, 2025

I don't find that example particularly clear, as it mixes binary and hex together and doesn't record how many bits per byte are currently occupied. I found the same text in my ancient copy of the CRAM v1.1 spec though, so it's likely never been rewritten. Back then most CRAM data was all multiplexed into the one CORE stream, with I think only QS and RN being "external". That's also where all the mix of huffman, golomb, golumb_rice, beta, gamma, and subexponential encodings came from. They couldn't write to specific blocks and all interleaved into CORE. That's a shame as actually they're not necessarily a bad strategy for separating into "external" blocks, but that had to be done via ITF8 or similar instead. I wasn't involved in CRAM of that era though, but I think the theory was it could be indexed with one byte/bit offset per record to permit genuine read-level random access of the CORE bit stream.

I'm not convinced this section is even in the correct place. It may belong with the bit-packing encodings such as BETA and HUFFMAN, as everything else is byte oriented using fixed size values such as uint32 or variable data via ITF8 and LTF8.

I don't understand quite what you were proposing. What do you think for splitting section 13 (encodings) to bit-based encodings to CORE block and byte based encodings to external blocks, with the former then having an introductory paragraph describing the bit encoding order. I cannot think of anywhere else that we write out bits so it's quite specific to those encoding methods. I think.

Edit: although we do have "writing bits to a stream" followed by "writing bytes to a stream", and the latter is needed there, so maybe it's correct they're together. Hmm... I'll look at clarifying that example at least!

@jkbonfield
Copy link
Contributor

I don't understand this either:

And the whole bit sequence:

\texttt{> echo "obase=2; ibase=16; B070" \textbar{} bc\
1011000001110000}

When reading the bits from the bit sequence it must be known that only 12 bits are meaningful and the bit stream should not be read after that.

I can't think of anywhere where we store the fact that only 12 bits are meaningful. Rather I think it's just implicit it how we decode.

The codecs, such as beta / gamma inform the number of bits to use, or huffman explicitly via the known code lengths. Basically we're reading a bit at a time and interpretting it.

So

1011000001110000 Starting state
011000001110000 (fetched 1 bit: 1)
11000001110000 (fetched 1 bit: 0)
000001110000 (fetched 2 bits: 11)
0000 (fetched 8 bits: 00000111)

At that stage we've consumed 12 of the 16 bits, but we didn't need to explicitly know the last 4 bits are invalid. They're simply unused as the decoding has now finished.

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Mar 25, 2025
Given it's working in bits, the example is much clearer if we describe
the bits instead of hex, especially distinguishing bits set/unset from
bits-yet-to-use.

I also reworked the note at the end of the section as it was quite
hard to follow.  I gave it real examples of BETA and HUFFMAN to
clarify what is meant by the decoder needing to know the number of
bits to consume.

Fixes samtools#812.
@jkbonfield jkbonfield moved this to New items in GA4GH File Formats Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New items
Development

Successfully merging a pull request may close this issue.

2 participants