Clarify the name tokeniser uncomp_len calculation (PR samtools#803)

jkbonfield · jkbonfield · commit bece1f70529a · 2025-01-07T14:44:15.000Z
This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes samtools#802
diff --git a/CRAMcodecs.tex b/CRAMcodecs.tex
@@ -2450,10 +2450,13 @@ \section{Name tokenisation codec}
 a format within a format, as the multiple byte streams $B_{pos,type}$
 are serialised into a single byte stream.
 
-The serialised data stream starts with two unsigned little endiand 32-bit
-integers holding the total size of uncompressed name buffer and the
-number of read names.  This is followed the array elements
-themselves.
+The serialised data stream starts with two unsigned little endian
+32-bit integers holding the total size of uncompressed name buffer and
+the number of read names.  This is followed the array elements
+themselves.  Note the uncompressed size the sum of all name lengths
+including a termination byte per name (e.g. the nul char).  This is
+irrespective of whether the implementation produces data in this form
+or whether it returns separate name and name-length arrays.
 
 Token types, $ttype$ holds one of the token ID values listed above
 in the list above, plus special values to indicate certain additional