Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with long ref and ref-skips #971

Open
jkbonfield opened this issue Nov 14, 2019 · 3 comments
Open

Issues with long ref and ref-skips #971

jkbonfield opened this issue Nov 14, 2019 · 3 comments

Comments

@jkbonfield
Copy link
Contributor

TODO: Fix the 28-bit limit on CIGAR length size in the internal memory structures.

While we have no immediate plans to support reads several Gb long, we do support references this long now.

However exome or tRNA style alignments (eg tophat) can produce long reference alignments by the addition of huge N ref-skip ops. Such alignments would be rare to be over 1Gb in genuine cases, but there is always room for badly mapped data and errors, plus there is potential for this to happen with circular genomes.

An example:

@SQ	SN:LONG	LN:8000000000
SRR065390.14978392	16	LONG	2	1	27M7000000000N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Samtools currently states:

$ samtools view ~/tmp/long.sam
[E::sam_parse1] numeric value out of allowed range
[W::sam_read1] Parse error at line 2
[main_samview] truncated file.

We can try chopping it up, but even 1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N doesn't work due to the 28-bit size limit in BAM's cigar handling. The internals of htslib shouldn't be so wedded to the BAM on disk format and ought to be using more than 28-bits. The SAM parser at least can swallow this if we go below 28-bits. eg:

@SQ	SN:LONG	LN:8000000000
SRR065390.14978392	16	LONG	2	1	27M200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

It can be written to BAM this way. Region queries pull this out OK. Also samtools mpileup -r LONG:7000000000-8000000000 ~/tmp/long3.bam works (after a suitably long pregnant pause).

There are issues with the CRAM decoder though:

$ samtools view ~/tmp/long3.cram
SRR065390.14978392	16	LONG	2	1	27M20678144N73M	*	0	0	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

I think this is actually the same bug in a different guise. CRAM is combining multiple neighbouring op codes together, but in doing so is overflowing the internal 28-bit limit.

@adf-ncgr
Copy link

FWIW, this same thing happens when representing whole genome alignments (e.g. as produced by minimap2) in bam; if the query sequence is > 2^28 then the clipping operations that indicate how far along the query sequence the alignment started may be wrong, causing some problematic behavior when trying to interpret the alignments in a dotplot context, where both query and ref positions for each aligned block are used (ends up looking like a translocation of the latter part of the query chromosome back to the beginning). Granted, chromosome sequences aren't exactly "reads" but it is handy to be able to use bam to represent whole-genome alignments for use with other tools. So +1 to addressing this!

@jkbonfield
Copy link
Contributor Author

#976 was an attempt to fix this for CRAM, but I forget why we rejected it as an idea.

(I think trying to represent whole genome alignments in BAM is perhaps a bit of an abuse, but my PR was to fix something more mundane and realistic - TopHat alignments on non-human genomes.)

Maybe I'll resurrect the idea, and put in logic for the SAM parser too.

@adf-ncgr
Copy link

Thanks for the quick reply and agreed (per #976) that the error handling has been much improved- I forgot that when I was really getting confused by the behavior I was using an older version of samtools that didn't complain about the overflow issue and resulted in the "translocation-like" behavior I described (and this was basically in the context of doing round-tripping from sam -> bam -> sam).

Not sure if it's an abuse or not, but it's certainly handy to be able to read in the whole genome alignments into a bam-aware tool like IGV; so thanks for considering it (again)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants