Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to fail on duplicate LN tags, with different LN values #1882

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Jan 27, 2025

Fixes #1866
Updated to use 1st LN tag value as in cram handling.
Accepts duplicate LN tags as long as they have same value.
If the values differ, returns failure.

sam_hdr_create updated for sam file checks and sam_hdr_sanitise for handling bam and cram.
sam_hdr_sanitise calls sam_hdr_fill_hrecs and in turn sam_hrecs_update_hashes, where duplicate check is made.

For cram, there is no added overhead as fill_hrecs is already in use.
For sam and bam, the sanitise update adds as overhead which is required to identify the duplicates.
For sam, the check occurs during normal parsing in sam_hdr_create and sanitise update is redundant.
For bam, fill_hrecs update is required to identify the duplicates. sanitise seem to be the best common place to have this.

@daviesrob daviesrob self-assigned this Jan 30, 2025
sam.c Outdated
@@ -1911,6 +1911,12 @@ static sam_hdr_t *sam_hdr_sanitise(sam_hdr_t *h) {
cp[h->l_text] = '\0';
}

if (sam_hdr_fill_hrecs(h) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to call sam_hdr_fill_hrecs() here. It will trigger a full parse of the header, which we may not want to do unless really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

sam.c Outdated
hts_pos_t tmp = strtoll(q + 3, (char**)&q, 10);
if (ln != -1 && tmp != ln) {
//duplicate LN tag with different value
hts_log_error("111Header includes @SQ line \"%s\" with \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message looks a bit odd...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@vasudeva8
Copy link
Contributor Author

updated to avoid hrec fill overhead.
bam has no change in processing and does not show any warning on duplicate LN tag in text part
cram shows the warning as earlier.
sam shows warning on duplicate LN tag and uses the 1st LN value as in CRAM processing.

bam processing is odd that it doesn't show the warning but avoids the overhead.

@daviesrob
Copy link
Member

I thik you may have removed too much in the last update. We still want to fail if SAM finds a header with two LN: tags.

@vasudeva8
Copy link
Contributor Author

If we have error return, then SAM and with some extra changes CRAM will fail in duplicate but BAM won't, as we will not invoke the hrec_fill due to overheads.
Is it not better to have consistent behaviour across formats, that all succeeds and just BAM won't warn?
As SAM uses the 1st LN value, the -ve value issue should not be there any more.

@vasudeva8
Copy link
Contributor Author

Updated to fail in case of duplicate LN tag with different values.

This change makes header read for sam, cram different from bam, where bam doesn't fail and sam, cram fails on header read api. This difference will be in library level but the library is mainly used in a utility (samtools) level where another operation like PG addition takes place, making the failure visible in bam as well and all formats have similar failure. (if --no-PG is used with bam it will not fail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency in parsing @SQ headers with multiple LN: and SN: tags
2 participants