Skip to content

Commit 7de0ae0

Browse files
committed
Clarify the name tokeniser uncomp_len calculation (PR samtools#803)
This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802 Also clarify the name tokeniser serialisation description. Acknowledge the 1-byte "use_arith" field and replace the nebulous "array elements" with a more descriptive text about token streams.
1 parent a6a4504 commit 7de0ae0

File tree

1 file changed

+12
-5
lines changed

1 file changed

+12
-5
lines changed

CRAMcodecs.tex

+12-5
Original file line numberDiff line numberDiff line change
@@ -2450,11 +2450,18 @@ \section{Name tokenisation codec}
24502450
a format within a format, as the multiple byte streams $B_{pos,type}$
24512451
are serialised into a single byte stream.
24522452

2453-
The serialised data stream starts with two unsigned little endiand 32-bit
2454-
integers holding the total size of uncompressed name buffer and the
2455-
number of read names. This is followed the array elements
2456-
themselves.
2457-
2453+
The serialised data stream starts with two unsigned little endian
2454+
32-bit integers holding the total size of uncompressed name buffer and
2455+
the number of read names, and a flag byte indicating whether data is
2456+
compressed with arithmetic coding or rANS Nx16.
2457+
Note the uncompressed size is calculated as the sum of
2458+
all name lengths including a termination byte per name (e.g. the nul
2459+
char). This is irrespective of whether the implementation produces
2460+
data in this form or whether it returns separate name and name-length
2461+
arrays.
2462+
2463+
This is then followed by serialised data and meta-data for each token
2464+
stream.
24582465
Token types, $ttype$ holds one of the token ID values listed above
24592466
in the list above, plus special values to indicate certain additional
24602467
flags. Bit 6 (64) set indicates that this entire token data stream is a

0 commit comments

Comments
 (0)