Skip to content

Commit 4982e03

Browse files
committed
Clarify the name tokeniser uncomp_len calculation (PR samtools#803)
This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802
1 parent 836fb61 commit 4982e03

File tree

1 file changed

+8
-4
lines changed

1 file changed

+8
-4
lines changed

Diff for: CRAMcodecs.tex

+8-4
Original file line numberDiff line numberDiff line change
@@ -2450,10 +2450,14 @@ \section{Name tokenisation codec}
24502450
a format within a format, as the multiple byte streams $B_{pos,type}$
24512451
are serialised into a single byte stream.
24522452

2453-
The serialised data stream starts with two unsigned little endiand 32-bit
2454-
integers holding the total size of uncompressed name buffer and the
2455-
number of read names. This is followed the array elements
2456-
themselves.
2453+
The serialised data stream starts with two unsigned little endian
2454+
32-bit integers holding the total size of uncompressed name buffer and
2455+
the number of read names. This is followed the array elements
2456+
themselves. Note the uncompressed size is calculated as the sum of
2457+
all name lengths including a termination byte per name (e.g. the nul
2458+
char). This is irrespective of whether the implementation produces
2459+
data in this form or whether it returns separate name and name-length
2460+
arrays.
24572461

24582462
Token types, $ttype$ holds one of the token ID values listed above
24592463
in the list above, plus special values to indicate certain additional

0 commit comments

Comments
 (0)