Skip to content

Commit 3b4e874

Browse files
committed
Allow for UTF-8 field values in header regular expression
Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes #719.
1 parent 7cfd789 commit 3b4e874

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

SAMv1.tex

+5-2
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ \section{The SAM Format Specification}
7777
For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character.
7878

7979
The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.
80+
For brevity, named character classes are written as~{\tt [\cclass{class}]} without an additional pair of brackets.
8081

8182
\subsection{An example}\label{sec:example}
8283
Suppose we have the following alignment with bases in lowercase
@@ -223,8 +224,10 @@ \subsection{The header section}
223224
each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG}
224225
is a two-character string that defines the format and content of {\tt VALUE}.
225226
Thus header lines match {\tt
226-
/\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[
227-
-\char126]+)+\$/} or {\tt /\char94@CO\char92t.*/}.
227+
/\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[\cclass{print}]+)+\$/}
228+
or {\tt /\char94@CO\char92t.*/}.%
229+
\footnote{{\tt [\cclass{print}]} indicates that header field values contain printable characters, i.e.,~non-control characters.
230+
For fields limited to~ASCII, which is the majority, this is equivalent to~{\tt [ -\char126]}.}
228231
Within each (non-{\tt @CO}) header line, no field tag may appear more than
229232
once and the order in which the fields appear is not significant.
230233

0 commit comments

Comments
 (0)