Issues with long ref and ref-skips #971

jkbonfield · 2019-11-14T10:04:09Z

TODO: Fix the 28-bit limit on CIGAR length size in the internal memory structures.

While we have no immediate plans to support reads several Gb long, we do support references this long now.

However exome or tRNA style alignments (eg tophat) can produce long reference alignments by the addition of huge N ref-skip ops. Such alignments would be rare to be over 1Gb in genuine cases, but there is always room for badly mapped data and errors, plus there is potential for this to happen with circular genomes.

An example:

@SQ	SN:LONG	LN:8000000000
SRR065390.14978392	16	LONG	2	1	27M7000000000N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Samtools currently states:

$ samtools view ~/tmp/long.sam
[E::sam_parse1] numeric value out of allowed range
[W::sam_read1] Parse error at line 2
[main_samview] truncated file.

We can try chopping it up, but even 1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N doesn't work due to the 28-bit size limit in BAM's cigar handling. The internals of htslib shouldn't be so wedded to the BAM on disk format and ought to be using more than 28-bits. The SAM parser at least can swallow this if we go below 28-bits. eg:

@SQ	SN:LONG	LN:8000000000
SRR065390.14978392	16	LONG	2	1	27M200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

It can be written to BAM this way. Region queries pull this out OK. Also samtools mpileup -r LONG:7000000000-8000000000 ~/tmp/long3.bam works (after a suitably long pregnant pause).

There are issues with the CRAM decoder though:

$ samtools view ~/tmp/long3.cram
SRR065390.14978392	16	LONG	2	1	27M20678144N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

I think this is actually the same bug in a different guise. CRAM is combining multiple neighbouring op codes together, but in doing so is overflowing the internal 28-bit limit.

The text was updated successfully, but these errors were encountered:

adf-ncgr · 2021-05-23T20:49:39Z

FWIW, this same thing happens when representing whole genome alignments (e.g. as produced by minimap2) in bam; if the query sequence is > 2^28 then the clipping operations that indicate how far along the query sequence the alignment started may be wrong, causing some problematic behavior when trying to interpret the alignments in a dotplot context, where both query and ref positions for each aligned block are used (ends up looking like a translocation of the latter part of the query chromosome back to the beginning). Granted, chromosome sequences aren't exactly "reads" but it is handy to be able to use bam to represent whole-genome alignments for use with other tools. So +1 to addressing this!

jkbonfield · 2021-05-24T08:23:57Z

#976 was an attempt to fix this for CRAM, but I forget why we rejected it as an idea.

(I think trying to represent whole genome alignments in BAM is perhaps a bit of an abuse, but my PR was to fix something more mundane and realistic - TopHat alignments on non-human genomes.)

Maybe I'll resurrect the idea, and put in logic for the SAM parser too.

adf-ncgr · 2021-05-25T00:00:23Z

Thanks for the quick reply and agreed (per #976) that the error handling has been much improved- I forgot that when I was really getting confused by the behavior I was using an older version of samtools that didn't complain about the overflow issue and resulted in the "translocation-like" behavior I described (and this was basically in the context of doing round-tripping from sam -> bam -> sam).

Not sure if it's an abuse or not, but it's certainly handy to be able to read in the whole genome alignments into a bam-aware tool like IGV; so thanks for considering it (again)!

daviesrob mentioned this issue Nov 14, 2019

Fix possible integer overflows in sam_parse1() #972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with long ref and ref-skips #971

Issues with long ref and ref-skips #971

jkbonfield commented Nov 14, 2019

adf-ncgr commented May 23, 2021

jkbonfield commented May 24, 2021

adf-ncgr commented May 25, 2021

Issues with long ref and ref-skips #971

Issues with long ref and ref-skips #971

Comments

jkbonfield commented Nov 14, 2019

adf-ncgr commented May 23, 2021

jkbonfield commented May 24, 2021

adf-ncgr commented May 25, 2021