-
Notifications
You must be signed in to change notification settings - Fork 443
/
bgzip.1
206 lines (190 loc) · 6.49 KB
/
bgzip.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
.TH bgzip 1 "12 September 2024" "htslib-1.21" "Bioinformatics tools"
.SH NAME
.PP
bgzip \- Block compression/decompression utility
.\"
.\" Copyright (C) 2009-2011 Broad Institute.
.\" Copyright (C) 2018, 2021-2024 Genome Research Limited.
.\"
.\" Author: Heng Li <[email protected]>
.\"
.\" Permission is hereby granted, free of charge, to any person obtaining a
.\" copy of this software and associated documentation files (the "Software"),
.\" to deal in the Software without restriction, including without limitation
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
.\" and/or sell copies of the Software, and to permit persons to whom the
.\" Software is furnished to do so, subject to the following conditions:
.\"
.\" The above copyright notice and this permission notice shall be included in
.\" all copies or substantial portions of the Software.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
.\" DEALINGS IN THE SOFTWARE.
.\"
.
.\" For code blocks and examples (cf groff's Ultrix-specific man macros)
.de EX
. in +\\$1
. nf
. ft CR
..
.de EE
. ft
. fi
. in
..
.SH SYNOPSIS
.PP
.B bgzip
.RB [ -cdfhikrt ]
.RB [ -b
.IR virtualOffset ]
.RB [ -I
.IR index_name ]
.RB [ -l
.IR compression_level ]
.RB [ -o
.IR outfile ]
.RB [ -s
.IR size ]
.RB [ -@
.IR threads ]
.RI [ file " ...]"
.PP
.SH DESCRIPTION
.PP
Bgzip compresses files in a similar manner to, and compatible with, gzip(1).
The file is compressed into a series of small (less than 64K) 'BGZF' blocks.
This allows indexes to be built against the compressed file and used to
retrieve portions of the data without having to decompress the entire file.
If no files are specified on the command line, bgzip will compress (or
decompress if the -d option is used) standard input to standard output.
If a file is specified, it will be compressed (or decompressed with -d).
If the -c option is used, the result will be written to standard output,
otherwise when compressing bgzip will write to a new file with a .gz
suffix and remove the original. When decompressing the input file must
have a .gz suffix, which will be removed to make the output name. Again
after decompression completes the input file will be removed. When multiple
files are given as input, the operation is performed on all of them. Access
and modification time of input file from filesystem is set to output file.
Note, access time may get updated by system when it deems appropriate.
.SH OPTIONS
.TP 10
.B "--binary"
Bgzip will attempt to ensure BGZF blocks end on a newline when the
input is a text file. The exception to this is where a single line is
larger than a BGZF block (64Kb). This can aid tools that use the
index to perform random access on the compressed stream, as the start
of a block is likely to also be the start of a text record.
This option processes text files as if they were binary content,
ignoring the location of newlines. This also restores the behaviour
for text files to bgzip version 1.15 and earlier.
.TP
.BI "-b, --offset " INT
Decompress to standard output from virtual file position (0-based uncompressed
offset).
Implies -c and -d.
.TP
.B "-c, --stdout"
Write to standard output, keep original files unchanged.
.TP
.B "-d, --decompress"
Decompress.
.TP
.B "-f, --force"
Overwrite files without asking, or decompress files that don't have a known
compression filename extension (e.g., \fI.gz\fR) without asking.
Use \fB--force\fR twice to do both without asking.
.TP
.B "-g, --rebgzip"
Try to use an existing index to create a compressed file with matching
block offsets. The index must be specified using the \fB-I
\fIfile.gzi\fR option.
Note that this assumes that the same compression library and level are in use
as when making the original file.
Don't use it unless you know what you're doing.
.TP
.B "-h, --help"
Displays a help message.
.TP
.B "-i, --index"
Create a BGZF index while compressing.
Unless the -I option is used, this will have the name of the compressed
file with .gzi appended to it.
.TP
.BI "-I, --index-name " FILE
Index file name.
.TP
.B "-k, --keep"
Do not delete input file during operation.
.TP
.BI "-l, --compress-level " INT
Compression level to use when compressing.
From 0 to 9, or -1 for the default level set by the compression library. [-1]
.TP
.BI "-o, --output " FILE
Write to a file, keep original files unchanged, will overwrite an existing
file.
.TP
.B "-r, --reindex"
Rebuild the index on an existing compressed file.
.TP
.BI "-s, --size " INT
Decompress INT bytes (uncompressed size) to standard output.
Implies -c.
.TP
.B "-t, --test"
Test the integrity of the compressed file.
.TP
.BI "-@, --threads " INT
Number of threads to use [1].
.PP
.SH BGZF FORMAT
The BGZF format written by bgzip is described in the SAM format specification
available from http://samtools.github.io/hts-specs/SAMv1.pdf.
It makes use of a gzip feature which allows compressed files to be
concatenated.
The input data is divided into blocks which are no larger than 64 kilobytes
both before and after compression (including compression headers).
Each block is compressed into a gzip file.
The gzip header includes an extra sub-field with identifier 'BC' and the length
of the compressed block, including all headers.
.SH GZI FORMAT
The index format is a binary file listing pairs of compressed and
uncompressed offsets in a BGZF file.
Each compressed offset points to the start of a BGZF block.
The uncompressed offset is the corresponding location in the uncompressed
data stream.
All values are stored as little-endian 64-bit unsigned integers.
The file contents are:
.EX 4
uint64_t number_entries
.EE
followed by number_entries pairs of:
.EX 4
uint64_t compressed_offset
uint64_t uncompressed_offset
.EE
.SH EXAMPLES
.EX 4
# Compress stdin to stdout
bgzip < /usr/share/dict/words > /tmp/words.gz
# Make a .gzi index
bgzip -r /tmp/words.gz
# Extract part of the data using the index
bgzip -b 367635 -s 4 /tmp/words.gz
# Uncompress the whole file, removing the compressed copy
bgzip -d /tmp/words.gz
.EE
.SH AUTHOR
.PP
The BGZF library was originally implemented by Bob Handsaker and modified
by Heng Li for remote file access and in-memory caching.
.SH SEE ALSO
.IR gzip (1),
.IR tabix (1)