Skip to content

Commit 7cb8ea0

Browse files
committed
Rework the CRAM bit encoding example and clarify text
Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use. I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume. Fixes samtools#812.
1 parent 7b61997 commit 7cb8ea0

File tree

1 file changed

+27
-24
lines changed

1 file changed

+27
-24
lines changed

CRAMv3.tex

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -124,13 +124,13 @@ \subsection{\textbf{Logical data types}}
124124
% \textbf{Array} & An array of any logical data type: \texttt{<}type\texttt{>}[ ] \\
125125
% \end{tabular}
126126

127-
\subsection{\textbf{Writing bits to a bit stream}}
127+
\subsection{\textbf{Reading and writing bits in a bit stream}}
128128

129-
A bit stream consists of a sequence of 1s and 0s. The bits are written most significant
130-
bit first where new bits are stacked to the right and full bytes on the left are
131-
written out. In a bit stream the last byte will be incomplete if less than 8 bits
132-
have been written to it. In this case the bits in the last byte are shifted to
133-
the left.
129+
The CORE block supports bit-based encoding methods.
130+
A bit stream consists of a sequence of 1s and 0s.
131+
The bits are written most significant bit first where new bits are stacked to the right and full bytes on the left are written out.
132+
In a bit stream the last byte will be incomplete if less than 8 bits have been written to it.
133+
In this case the bits in the last byte are shifted to the left to complete a whole byte.
134134

135135
\subsubsection*{Example of writing to bit stream}
136136

@@ -141,13 +141,13 @@ \subsubsection*{Example of writing to bit stream}
141141
\hline
142142
\textbf{Operation order} & \textbf{Buffer state before} & \textbf{Written bits} & \textbf{Buffer state after} & \textbf{Issued bytes}\tabularnewline
143143
\hline
144-
1 & 0x0 & 1 & 0x1 & -\tabularnewline
144+
1 & xxxx xxxx & 1 & xxxx xxx1 (0x01) & -\tabularnewline
145145
\hline
146-
2 & 0x1 & 0 & 0x2 & -\tabularnewline
146+
2 & xxxx xxxx & 0 & xxxx xx10 (0x02) & -\tabularnewline
147147
\hline
148-
3 & 0x2 & 11 & 0xB & -\tabularnewline
148+
3 & xxxx xx10 & 11 & xxxx 1011 (0x0B) & -\tabularnewline
149149
\hline
150-
4 & 0xB & 0000 0111 & 0x7 & 0xB0\tabularnewline
150+
4 & xxxx 1011 & 0000 0111 & xxxx 0111 (0x07) & 1011 0000 (0xB0)\tabularnewline
151151
\hline
152152
\end{tabular}
153153

@@ -166,26 +166,29 @@ \subsubsection*{Example of writing to bit stream}
166166
\texttt{> echo "obase=2; ibase=16; B070" \textbar{} bc\\
167167
1011000001110000}
168168

169-
When reading the bits from the bit sequence it must be known that only 12 bits
170-
are meaningful and the bit stream should not be read after that.
169+
When reading the bits from the bit sequence, only the first 12 bits are meaningful and the remaining 4 will should be discarded.
171170

172-
\subsubsection*{Note on writing to bit stream}
171+
\subsubsection*{Note on reading from and writing to bit stream}
173172

174-
When writing to a bit stream both the value and the number of bits in the value
175-
must be known. This is because programming languages normally operate with bytes
176-
(8 bits) and to specify which bits are to be written requires a bit-holder, for
177-
example an integer, and the number of bits in it. Equally, when reading a value
178-
from a bit stream the number of bits must be known in advance. In case of prefix
179-
codes (e.g. Huffman) all possible bit combinations are either known in advance
180-
or it is possible to calculate how many bits will follow based on the first few
181-
bits. Alternatively, two codes can be combined, where the first contains the number
182-
of bits to read.
173+
When reading and writing to a bit stream our numeric values are
174+
typically held in a byte oriented data type, such as an 8-bit or
175+
32-bit integer.
176+
The bit stream itself does not explicitly store the number of bits
177+
per value, and it will vary by context, so we must know this by other means.
178+
For example, we may be reading bits using a BETA encoding whose parameters
179+
indicate each value is 6 bits.
180+
So we read the next 6 bits into a 32-bit integer to get a value
181+
between 0 and 31.
182+
The next bits may be for a HUFFMAN encoding, in which case we can read one
183+
bit at a time until we match a known code-word in the Huffman tree.
183184

184185
\subsection{\textbf{Writing bytes to a byte stream}}
185186
\label{subsec:writing-bytes}
186187

187-
The interpretation of byte stream is straightforward. CRAM uses \emph{little endianness}
188-
for bytes when applicable and defines the following storage data types:
188+
Byte streams cannot be mixed in the same block as bit streams.
189+
The interpretation of byte stream is straightforward.
190+
CRAM uses \emph{little endianness} for bytes when applicable and
191+
defines the following storage data types:
189192

190193
\begin{description}
191194

0 commit comments

Comments
 (0)