Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cram: Define coordinate system for alignment positions #810

Open
zaeleus opened this issue Feb 4, 2025 · 1 comment · May be fixed by #821
Open

cram: Define coordinate system for alignment positions #810

zaeleus opened this issue Feb 4, 2025 · 1 comment · May be fixed by #821
Labels

Comments

@zaeleus
Copy link

zaeleus commented Feb 4, 2025

This is in regard to CRAM format specification (version 3.1) (2024-09-04).

The coordinate system used for alignment positions is undefined. In practice, alignment positions are 1-based.

@jkbonfield
Copy link
Contributor

It looks to be semi-defined. For alignment coordinate sorded data it is defined to be a delta against the container start coordinate. Although not explicitly defined, a delta implies 0 as being equal. So we coordinate sorted data we see this:

@ seq22-head2[samtools.../hts-specs]; cat _.sam
@SQ	SN:c1	LN:10
s0	0	c1	1	0	10M	*	0	0	AACCGCGGTT	*
s1	0	c1	5	0	10M	*	0	0	AACCGCGGTT	*
@ seq22-head2[samtools.../hts-specs]; samtools view -O cram,no_ref _.sam | cram_dump -v - | egrep '( Ref pos:|^AP =|AP => [01])'
    Ref pos:           1 + 14
	AP => 1 (0x1)
AP = 0 (ret 0, out_sz 1)
AP = 4 (ret 0, out_sz 1)
    Ref pos:           4542278 + 0

With unsorted data, we have "AP => 0" in the container header which turns off the deltas. Then we store AP verbatim instead of using deltas, and at that point we do indeed store 1-based coordinates. This is a good thing given unmapped data causes negative values, which was something I always disliked in BAM!

@ seq22-head2[samtools.../hts-specs]; cat _2.sam
@SQ	SN:c1	LN:10
s1	0	c1	5	0	10M	*	0	0	AACCGCGGTT	*
s0	0	c1	1	0	10M	*	0	0	AACCGCGGTT	*
@ seq22-head2[samtools.../hts-specs]; samtools view -O cram,no_ref _2.sam | cram_dump -v - | egrep '( Ref pos:|^AP =|AP => [01])'
    Ref pos:           1 + 14
	AP => 0 (0x0)
AP = 5 (ret 0, out_sz 1)
AP = 1 (ret 0, out_sz 1)
    Ref pos:           4542278 + 0

I'll amend the text to be explicit.

The other places coordinates are used is in the PNEXT field. These too are 1-based, or 0 for unused.

@ seq22-head2[samtools.../hts-specs]; cat _3.sam
@SQ	SN:c1	LN:20
s0	0	c1	1	0	10M	=	7	10	AACCGCGGTT	*
s1	0	c1	5	0	10M	=	1	5	AACCGCGGTT	*
s2	0	c1	7	0	10M	*	0	0	GGGGGGGGGG	*
@ seq22-head2[samtools.../hts-specs]; samtools view -O cram,no_ref _3.sam | cram_dump -v - | egrep '^NP ='
NP = 7 (ret 0, out_sz 1)
NP = 1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)

At least it's consistently 1 based. Phew!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New items
Development

Successfully merging a pull request may close this issue.

2 participants