Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust parser for undefined CIGARs #1182

Closed
wants to merge 3 commits into from

Conversation

valeriuo
Copy link
Contributor

The newly added sam_parse_cigar and bam_parse_cigar methods can now handle unavailable/undefined CIGAR strings (*). The returned value (number of CIGAR operators) in this case is still 0. However, the return type of both methods was changed to signed integer, so that errors can be reported with -1 code.
Tests were added for sam_parse_cigar.

Related to #1169

sam.c Outdated
@@ -2150,8 +2150,8 @@ int sam_parse1(kstring_t *s, sam_hdr_t *h, bam1_t *b)
if (*p != '*') {
uint32_t *cigar = NULL;
int old_l_data = b->l_data;
uint32_t n_cigar = bam_parse_cigar(p, &p, b);
if (!n_cigar || *p++ != '\t') goto err_ret;
int32_t n_cigar = bam_parse_cigar(p, &p, b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being nit-picky this should be ssize_t, but frankly the world will have broken long before we get to 2 billion cigar elements! Infact I'm not sure it's possible in BAM as the bam record would overflow in length.

I'd just go with an unsized "int" or a specific "ssize_t" if you feel inclined to be pedantic.

cig = "2M3X1I10M5D";
n = sam_parse_cigar(cig, &end, &buf, &m);
VERIFY(n == 5 && m > 0 && (end-cig) == 11, "failed to parse CIGAR string: 2M3X1I10M5D");
n = sam_parse_cigar("722M15D187217376188323783284M67I", NULL, &buf, &m);
Copy link
Contributor

@jkbonfield jkbonfield Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: ignore me! Sorry it's validating it can't parse it. Good one. :-)

I'm not sure what the point is of this test. 187217376188323783284 is 68 bits long.

Cigars are held inside htslib as uint32_t with 4 bits being the type, so the maximum length of cigar op we currently handle is 268435455 (2^28-1). CRAM could handle longer lengths (if #976 got merged), but I think we rejected that idea as the in-memory htslib struct would need fixing first.

I'm sure we discussed changing the cigar type when we were doing all of our sweeping ABI changes and it was rejected, so I think we're committed to having excessively long cigar lengths as a series of concatenated smaller lengths using the same op type. (I don't see a reason to support that though until we get reads long enough and accurate enough to generate such data. We're considerably far off that still.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does raise the question though why the test isn't:

VERIFY(n == -1, "failed to flag CIGAR string with long op length: 722M15D187217376188323783284M67I");

The documentation for sam_parse_cigar says -1 on error and "number of processed CIGAR operators" otherwise. If the CIGAR has been rejected (even if syntacically valid) then it should probably return -1 in this case.

htslib/sam.h Outdated
*/
HTSLIB_EXPORT
size_t sam_parse_cigar(const char *in, char **end, uint32_t **a_cigar, uint32_t *a_mem);
ssize_t sam_parse_cigar(const char *in, char **end, uint32_t **a_cigar, uint32_t *a_mem);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By luck sam.h currently gets it via kstring.h, but in case that changes: need to add #include <sys/types.h> at the top of sam.h to ensure ssize_t is declared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch.

htslib/sam.h Outdated
@@ -1110,21 +1110,21 @@ int bam_set_qname(bam1_t *b, const char *qname);
can be NULL
@param a_cigar [out] address of the destination uint32_t buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_cigar is [in/out] presumably?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought of just being [out], as the information contained by the buffer is generated inside the method, but the pointer itself is being read. So, I guess a case can be made for [in].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in because the pointed-to variable needs to already have a meaningful value before the call (this is how it differs from end, whose pointed-to variable may even be uninitialised). It's out because the pointed-to value might be different after the call.

Therefore it's [in/out], the same as a_mem.

@jmarshall
Copy link
Member

While we're revisiting the signatures of these new functions,

-size_t sam_parse_cigar(const char *in, char **end, uint32_t **a_cigar, uint32_t *a_mem);
+ssize_t sam_parse_cigar(const char *in, char **end, uint32_t **a_cigar, uint32_t *a_mem);

using such a low-level type for the user's buffer length is not ideal: the proper type for this in C would be size_t *a_mem.

Also the two utility functions should not cast away const so soon, as this invites bugs. end needs to be a non-const char * for the same reasons as in strtol() etc, but it would be better to use a (char *) cast only when actually writing to end. So read_ncigar() should use only a const char * and parse_cigar() should primarily use a const char *p (see e.g. jmarshall/htslib@a12d4d4).

@jkbonfield
Copy link
Contributor

Rebased & squashed, merged as 2056490

@jkbonfield jkbonfield closed this Nov 30, 2020
@valeriuo valeriuo deleted the cigar_string branch December 3, 2020 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants