Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cram: Clarify the preservation map's reference required flag for unmapped slices #809

Open
zaeleus opened this issue Feb 4, 2025 · 1 comment
Labels

Comments

@zaeleus
Copy link

zaeleus commented Feb 4, 2025

This is in regard to CRAM format specification (version 3.1) (2024-09-04).

§ 8.4.1 "Compression header block: Preservation map" describes the reference required (RR) flag as "true if reference sequence is required to restore the data completely". If this is true and the records do not require a reference sequence to restore the data (e.g., an unmapped slice), is it considered an invalid state?

As I understand it, this flag should be false if the slice is unmapped, but some implementation don't set it as such, e.g., htslib:

$ samtools --version | head -2
samtools 1.21
Using htslib 1.21

$ (
samtools view --output-fmt cram <<EOF
@HD	VN:1.6
r1	4	*	0	0	*	*	0	0	NNNN	!!!!
EOF
) | cram_dump - | grep --after-context 4 "Preservation map" | head -5
      Preservation map:
        SM => 30398990 (0x1cfda0e)
        TD => 30398986 (0x1cfda0a)
        RN => 1 (0x1)
        AP => 1 (0x1)

All the records in this slice are unmapped (cram_dump: "Slice ref seq -1"), and this implicitly sets the reference required field to true, as per "The boolean values are optional, defaulting to true when absent..." However, a reference sequence is not required to decode this.

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 25, 2025

That's a good point. CRAM v1 didn't have RR. It was added for v2 so we could encode data without needing a reference; useful for unmapped data and also sometimes for mapped but unsorted data. It looks like I never added the explicit RR=0 data in there though.

I'm not sure if this makes the htslib files explicitly non-conforming or not. Arguably yes as the spec is written, but the intention was simply to use this as a hint for whether a reference needs to be loaded, and where the container and slice header explicitly. Given the container has "ref id: -1", it's neither here nor there whether it attempts to load it or not as it's akin to saying "don't load the reference" vs "load reference ".

I'll chalk it up as an htslib bug however.

Edit: I tested picard 2.26 (we don't have a new enough Java for picard 3) and it explicitly sets RR=1, which is even more inappropriate. Bah.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New items
Development

No branches or pull requests

2 participants