Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting of depth with/without overlap by bcftools mpileup and samtools depth #1504

Closed
claire-anderson-1 opened this issue Jun 8, 2021 · 4 comments

Comments

@claire-anderson-1
Copy link

claire-anderson-1 commented Jun 8, 2021

This query relates to both samtools depth and bcftools mpileup, but I have only posted my issue here on the bcftools development site.

I have been using the following versions of samtools depth and bcftools mpileup to determine sequencing depth:
• samtools depth Version: 1.8 (using htslib 1.8) - installed on my work cluster
• bcftools mpileup Version: 1.8 (using htslib 1.8) - installed on my work cluster
• samtools depth Version: 1.12-26-ge77c7c6 (using htslib 1.12-25-g5b684ce) – installed locally
• bcftools mpileup Version: 1.12-21-ga865a16 (using htslib 1.12-25-g5b684ce) – installed locally

I created a test scenario with 18 paired end sequences that align to a short, artificial reference sequence. The sequences were aligned using bwa mem and then filtered and sorted using samtools view/sort.

Having manually checked the alignment file:
• All forward sequences have a flag of 99 and all reverse sequences have a flag of 147, indicating proper mapping of read pairs
• Mapping quality is 60 for all reads, so this is well above any filters used in bcftools mpileup
• The CIGAR strings indicate that all bases align (no insertions or deletions)
• The start positions are all correct
• The total template lengths are all correct

So I conclude that there is no error in my alignment file. However, when I used the above mentioned versions of samtools depth and bcftools mpileup, these utilities do not report the expected sequence depths when different values of -d were passed (with or without -x/-s). I observed that:
• neither bcftools mpileup version 1.8 nor 1.12 detected any overlapping sequences
• samtools depth version 1.18 and bcftools mpileup versions 1.8 and 1.12 all give the same (often incorrect) depth
• when using samtools depth version 1.18 or bcftools mpileup versions 1.8 and 1.12, changing -d from 10 to 20 or from 20 to 200 can change the reported sequence depth at a given position, despite the fact that there are a maximum of five reads starting at any one site
• samtools depth version 1.12 is the only utility that is correctly identifying overlapping paired reads
• samtools depth version 1.12 does not generate an error message when -d [int] is passed, yet this option does not appear to do anything as this version of samtools depth reported the same depth irrespective of the value passed to -d

I have attached a PDF that contains more detail.

The observed behaviour does not match my expectation based on my reading of the manual pages for these utilities. Are they performing correctly? If so, can you please explain why I see the observed results?

I look forward to your response!

Depth_bug_report.pdf

@jkbonfield
Copy link
Contributor

The latest samtools depth does completely ignore the -d parameter, but it's also not documented any more either. We decided to accept and discard it rather than issue an error because many pipelines use -d 0 or e.g. -d 999999 as a way to remove the depth limit.

The issue with older commands and mpileup is how the depth removal works. It's complex, and leads to nasty edge cases. I tried to improve this in samtools/htslib#1084, but it also had some edge cases remaining so it wasn't accepted. (IMO it's still a step forward however.)

Fundamentally limiting depth is messy unless you're willing to chop sequences in half. Removing a sequence because it is over-deep in one place may inevitably lead to under-depth somewhere else. That's just the nature of the game. Given the original intention for the depth limit was to prevent deep pileup from causing big slow downs and high memory usage, I think the way it ought to work is to specify a minimum upper-bound or some sort of mean depth, where it's fine for some regions to be a bit deeper. It's not fine however for artifacts / edge-cases to cause a complete collapse of depth down to zero, which is what happens right now sadly.

@claire-anderson-1
Copy link
Author

Hi James,

Thanks for your reply to my enquiry. Based on my reading of the documentation, I thought that the -d read depth filter was simply ignoring reads once the threshold number of reads had already been assessed at any given start position. However, from your response and the discussion on samtools/htslib#1084 it seems that my interpretation is too simplistic.

I want to assess all of my reads, so I don’t actually want to use this filtering threshold. Is there a way to turn it off in bcftools mpileup? In samtools depth V1.8 I could just use -d 0 to remove this filtering. It seems there is no equivalent in bcftools mpileup. My only option seems to be to determine the maximum number of reads in my sequence file at any given position and pass this to -d. Otherwise just arbitrarily use a very high number like 999999 (does -d have a maximum value that can be passed?).

Thanks for confirming that -d is non-functional in samtools depth V1.12. The output from this utility now makes perfect sense, including detection of overlapping paired end sequences.

I’m still unsure about identification of overlapping paired end sequences in bcftools V1.8 and V1.12 though. When identical runs with -x or without -x are compared for my test data, they give the same number of reads (as though overlap detection in bcftools mpileup is non-functional; see the tables for positions 74, 78 and 100 in the PDF). However, having done the same test with some real data I can see an effect of having overlap detection on (no -x). I would have thought that overlap detection in samtools depth V1.12 and bcftools mpileup V1.12 is achieved in the same way – is this the case? If so, why does samtools depth V1.12 correctly identify the overlaps in my test data when bcftools mpileup V1.12 does not?

Regards,
Claire

@jkbonfield
Copy link
Contributor

The overlap removal may be due to the properly paired flag.

Samtools depth now ignores this - any read listed as being paired in sequencing is a candidate for overlap removal. Previously it had to also be marked as properly paired by the aligner. Aligners can, and do, unset "properly paired" for reads with insert size outside their expected distribution. That means very short inserts with high overlap wouldn't be considered as overlapping. Ie the very case we'd want removing is the case it cannot! I deemed that to be an error so fixed it in my rewrite. I didn't change bcftools or samtools mpileup though. Indeed I didn't even check, as I was simply concentrating on rewriting depth at the time. I don't know how they handle that flag.

The other thing to consider is the mpileup overlap removal uses the base calls and their quality values. It checks if the overlapping sequences matched, and modified quality accordingly (for qual filtering). Overlapping bases get qual 0, meaning enabling overlap removal forcibly sets minimum qual to 1. Hence any "N"s are filtered as "overlapping" whether you like them or not, and whether they are overlapping or not. IIRC, overlapping reads with bases that matched get qual1+qual2, while ones that mismatched have 0.8*MAX(qual1,qual2). It's rather complex, and possibly poorly documented (I haven't checked).

The old depth code used mpileup internally, but it made it very slow, especially on deep data, and also made it very memory hungry on deep data too as it has to hold all reads covering a region in memory. Hence the existance of a maximum depth parameter to the depth code (which always felt a bit bizarre to me). The new code just holds one read in memory at a time plus a table of current read names (for the overlap removal). That also means it cannot check whether the sequences match and so it has no issues with Ns and it doesn't do the raising / lowering of quality values for filtering purposes.

I haven't checked what differences there are between samtools and bcftools mpileup, but I'd guess anything there is in the command line arguments and default values. Both share the same mpileup logic as that part is in the shared htslib library. However as I say it's no longer also shared by samtools depth.

If bcftools doesn't accept -d 0, then you will need to specify a large number. INT_MAX is probably the safest option (try 2 billion, which is close to 2^31 so won't overflow a signed integer).

@pd3
Copy link
Member

pd3 commented Aug 31, 2021

I believe this is now resolved

@pd3 pd3 closed this as completed Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants