Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools stats differ if the file is sorted by coordinate or queryname #2177

Open
fgvieira opened this issue Jan 27, 2025 · 1 comment · May be fixed by #2198
Open

Samtools stats differ if the file is sorted by coordinate or queryname #2177

fgvieira opened this issue Jan 27, 2025 · 1 comment · May be fixed by #2198
Assignees

Comments

@fgvieira
Copy link

Are you using the latest version of samtools and HTSlib? If not, please specify.

samtools 1.21
Using htslib 1.21
Copyright (C) 2024 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -I/opt/software/libdeflate/1.21/
    LDFLAGS:        -L/opt/software/libdeflate/1.21/build
    HTSDIR:         htslib-1.21
    LIBS:           
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes plugins=no htscodecs=1.6.1
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -I/opt/software/libdeflate/1.21/ -fvisibility=hidden
    LDFLAGS:        -L/opt/software/libdeflate/1.21/build -fvisibility=hidden 

HTSlib URL scheme handlers present:
    built-in:	 file, preload, data
    S3 Multipart Upload:	 s3w+https, s3w+http, s3w
    Amazon S3:	 s3+https, s3, s3+http
    Google Cloud Storage:	 gs+http, gs+https, gs
    libcurl:	 smtp, ldaps, smb, rtsp, tftp, pop3, smbs, imaps, pop3s, ftps, https, http, ftp, gopher, imap, sftp, ldap, smtps, dict, scp, telnet
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

Please describe your environment.

  • OS: Linux 4.18.0-553.16.1.el8_10.x86_64
  • machine architecture: x86_64

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

The input file is queryname sorted:

@HD     VN:1.5  SO:queryname    SS:queryname:lexicographical

If I run samtools stats on the above file (below left) or after sorting by coordinate (below right), I get slightly different outptus:

> diff -y -W 250 queryname.sort.stats coordinate.sorted.txt
[...]
SN      is sorted:      0                                                                                                   |   SN      is sorted:      1
[...]
# Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.                                                      # Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.
COV     [1-1]   1       2095061                                                                                             |   COV     [1-1]   1       36918165
COV     [2-2]   2       544                                                                                                 |   COV     [2-2]   2       6339356
                                                                                                                            >   COV     [3-3]   3       1556677
                                                                                                                            >   COV     [4-4]   4       423176
                                                                                                                            >   COV     [5-5]   5       116074
                                                                                                                            >   COV     [6-6]   6       36544
                                                                                                                            >   COV     [7-7]   7       9756
                                                                                                                            >   COV     [8-8]   8       3961
                                                                                                                            >   COV     [9-9]   9       1559
                                                                                                                            >   COV     [10-10] 10      857
                                                                                                                            >   COV     [11-11] 11      621
                                                                                                                            >   COV     [12-12] 12      420
                                                                                                                            >   COV     [13-13] 13      1107
                                                                                                                            >   COV     [14-14] 14      539
                                                                                                                            >   COV     [15-15] 15      364
                                                                                                                            >   COV     [16-16] 16      95
                                                                                                                            >   COV     [17-17] 17      112
                                                                                                                            >   COV     [18-18] 18      53
                                                                                                                            >   COV     [19-19] 19      28
                                                                                                                            >   COV     [20-20] 20      42
                                                                                                                            >   COV     [21-21] 21      71
                                                                                                                            >   COV     [22-22] 22      77
                                                                                                                            >   COV     [23-23] 23      58
                                                                                                                            >   COV     [24-24] 24      44
                                                                                                                            >   COV     [25-25] 25      27
                                                                                                                            >   COV     [26-26] 26      45
                                                                                                                            >   COV     [27-27] 27      32
                                                                                                                            >   COV     [28-28] 28      45
                                                                                                                            >   COV     [29-29] 29      95
                                                                                                                            >   COV     [30-30] 30      3
                                                                                                                            >   COV     [31-31] 31      11
# GC-depth. Use `grep ^GCD | cut -f 2-` to extract this part. The columns are: GC%, unique sequence percentiles, 10th, 25t      # GC-depth. Use `grep ^GCD | cut -f 2-` to extract this part. The columns are: GC%, unique sequence percentiles, 10th, 25t
GCD     0.0     0.007   0.000   0.000   0.000   0.000   0.000                                                               |   GCD     0.0     0.001   0.000   0.000   0.000   0.000   0.000
GCD     0.5     0.011   0.004   0.004   0.004   0.004   0.004                                                               |   GCD     0.5     0.001   0.004   0.004   0.004   0.004   0.004
GCD     20.0    0.059   0.004   0.004   0.004   0.004   0.004                                                               |   GCD     9.0     0.032   0.004   0.004   0.004   0.004   0.004
GCD     21.0    0.092   0.004   0.004   0.004   0.004   0.004                                                               |   GCD     10.0    0.047   0.004   0.004   0.004   0.004   0.004
GCD     22.0    0.165   0.004   0.004   0.004   0.004   0.004                                                               |   GCD     11.0    0.063   0.004   0.004   0.004   0.004   0.004
[...]
  • If the original file is sorted by queryname why does it outputs is sorted: 0?
  • From the output, it seems that COV and GCD are only supported if sorted by coordinate; if so, shouldn't they just be omitted from the output? According to the docs: Not all sections will be reported as some depend on the data being coordinate sorted while others are only present when specific barcode tags are in use.
@daviesrob daviesrob self-assigned this Jan 30, 2025
@daviesrob
Copy link
Member

Yes, "is sorted" should really be "is coordinate sorted". Note that the value written comes from inspecting the alignment records to check they're in the expected order; samtools stats does not believe the SO: tag. And you're correct, COV and GCD need the data to be coordinate sorted, so they shouldn't be printed if the alignments were not in the required order.

whitwham added a commit to whitwham/samtools that referenced this issue Mar 11, 2025
Removed COV and GCD entries from name sorted stats results.  Fixes samtools#2177.
@whitwham whitwham linked a pull request Mar 11, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants