Skip to content

Commit

Permalink
Merge pull request #209 from rhpvorderman/issue203
Browse files Browse the repository at this point in the history
Ignore secondary and supplementary alignment records.
  • Loading branch information
rhpvorderman authored Nov 22, 2024
2 parents 7ae8584 + 08b581a commit 11ed7b7
Show file tree
Hide file tree
Showing 6 changed files with 44 additions and 2 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ version 0.13.0-dev
------------------
+ Python 3.13 support was added.
+ Python 3.8 is no longer supported.
+ Allow proper judging of aligned BAM files as input data by ignoring any
secondary or supplementary alignment records. This is equivalent to running
``samtools fastq > input.fastq`` on the input data before submitting it to
sequali, except that useful tag information is retained.
+ Only sample the first 100 bp from the beginning and end of the read by
default for the overrepresented sequences analysis. This prevents a lot of
false positives from common human genome repeats. The amount of base pairs
Expand Down
6 changes: 4 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,12 @@ Supported formats
per tile quality will be provided.
- For sequences called by guppy additional plots for nanopore specific
data will be provided.
- unaligned BAM. Any alignment flags are currently ignored.
- (unaligned) BAM with single reads. Read-pair information is currently ignored.

- For uBAM data as delivered by dorado additional nanopore plots will be
- For BAM data as delivered by dorado additional nanopore plots will be
provided.
- For aligned BAM files, secondary and supplementary reads are ignored
similar to how ``samtools fastq`` handles the data.

.. formats end
Expand Down
20 changes: 20 additions & 0 deletions src/sequali/_qcmodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -1101,6 +1101,21 @@ PyTypeObject FastqParser_Type = {
* BAM PARSER *
* ************/

#define BAM_FPAIRED 1
#define BAM_FPROPER_PAIR 2
#define BAM_FUNMAP 4
#define BAM_FMUNMAP 8
#define BAM_FREVERSE 16
#define BAM_FMREVERSE 32
#define BAM_FREAD1 64
#define BAM_FREAD2 128
#define BAM_FSECONDARY 256
#define BAM_FQCFAIL 512
#define BAM_FDUP 1024
#define BAM_FSUPPLEMENTARY 2048

#define BAM_EXCLUDE_FLAGS (BAM_FSECONDARY | BAM_FSUPPLEMENTARY)

typedef struct _BamParserStruct {
PyObject_HEAD
uint8_t *record_start;
Expand Down Expand Up @@ -1452,6 +1467,11 @@ BamParser__next__(BamParser *self)
if (record_end > buffer_end) {
break;
}
if (header->flag & BAM_EXCLUDE_FLAGS) {
// Skip excluded records such as secondary and supplementary alignments.
record_start = record_end;
continue;
}
uint8_t *bam_name_start =
record_start + sizeof(struct BamRecordHeader);
uint32_t name_length = header->l_read_name;
Expand Down
Binary file added tests/data/test_skip.bam
Binary file not shown.
7 changes: 7 additions & 0 deletions tests/data/test_skip.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
@HD VN:1.6 SO:unsorted
@RG ID:A SM:simple
unmapped 4 * 0 0 * * 0 0 GATTACA * RG:Z:A
secondary 256 * 0 0 * * 0 0 GATTACA * RG:Z:A
supplementary 2048 * 0 0 * * 0 0 GATTACA * RG:Z:A
secondary_and_supplementary 2304 * 0 0 * * 0 0 GATTACA * RG:Z:A
everything_but_secondary_and_supplementary 1663 * 0 0 * * 0 0 GATTACA * RG:Z:A
9 changes: 9 additions & 0 deletions tests/test_bam_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,12 @@ def test_bam_parser_no_quals():
assert records[0][0].name() == "Myheader"
assert records[0][0].sequence() == "GATTACA"
assert records[0][0].qualities() == "!!!!!!!"


def test_bam_parser_skip_secondary_supplementary():
with xopen.xopen(DATA / "test_skip.bam", "rb") as f:
parser = BamParser(f)
records = list(parser)[0]
assert len(records) == 2
assert records[0].name() == "unmapped"
assert records[1].name() == "everything_but_secondary_and_supplementary"

0 comments on commit 11ed7b7

Please sign in to comment.