Explicitly permit eg N+m in MM tag (PR#799)#799
Explicitly permit eg N+m in MM tag (PR#799)#799jkbonfield wants to merge 1 commit intosamtools:masterfrom
Conversation
|
Previous sentence says “An unmodified base of ‘N’”; this new sentence says “A fundamental base of ‘N’”. Is there a distinction between unmodified and fundamental (if so, let's describe it)? Otherwise, let's use one or the other. |
The text already states that an unmodified base of N means we count any base type, but base N code N in the table is a little misleading as to the intention. It was intended to mean any unspecified modification, in the same way C+C is any unspecified C mod, but in this case it's against all bases rather than a specific base type. However that doesn't solve the issue of whether we can record specific mods against any "fundamental" source base. Clarified this by adding an extra line to the table and some text. (However note this doesn't necessarily imply downstream processing tools will not do any compatibility assessment and reject N+m when the SEQ base is a T.) Fixes samtools#785
|
I reviewed the change again and decided "fundamental" base is the clearer term. I note the table below this has unmodified base, but I'm unsure whether that should be changed or not. For the language "C+hm," it is the "C" here we refer to as "fundamental", given unmodified is a confusing term to use when describing a base that is infact modified. However for the table of from-this to-that unmodified may be easier to understand and we're not explicitly referring to a string or serialisation syntax for that table. I'm not overly happy with the first paragraph of "from the primary unmodified sequence as originally reported by the sequencing instrument", given the sequencing instrument is likely the thing stating this is modified. However typically it does that in stages, recording the unmodified base type in FASTA/FASTQ and a modification as a side-channel. I want to say as recorded in the SAM SEQ column as that forbids things like h, m, etc, but then it gets mired in reverse-complementing confusions. I think for now it's better, but not perfect. Ideas welcomed, although maybe best dealt with in another PR if it's a big rewrite. |
The text already states that an unmodified base of N means we count any base type, but base N code N in the table is a little misleading as to the intention. It was intended to mean any unspecified modification, in the same way C+C is any unspecified C mod, but in this case it's against all bases rather than a specific base type.
However that doesn't solve the issue of whether we can record specific mods against any "fundamental" source base. Clarified this by adding an extra line to the table and some text. (However note this doesn't necessarily imply downstream processing tools will not do any compatibility assessment and reject N+m when the SEQ base is a T.)
Fixes #785