Main page

Quick Start

Run "Kanzi -h" so see the help general page and "Kanzi -c -h" or "Kanzi -d -h" to see the help page for compression and decompression respectively.

The easiest way to compress is to use one of the existing pre-defined levels (0 to 9) using "-l".

The input file or directory must be provided with "-i".

The output file or directory can be provided with "-o".

Use "-b" to specify a block size (if omitted the default block size is selected).

The compression or decompression can be done concurrently by providing the number of jobs using "-j".

Instead of a specific compression level, one can specify the transforms and entropy to use by using "-t" and "-e".

See the help pages for the list of transforms and entropy codecs available.

The order of the parameters on the command line does not matter.

Examples:

Kanzi -c -i foo.txt -o none -b 4m -l 4 -v 3

Do a dry-run compression of foo.txt using block of 4 MB and the level 4 with a verbosity set to 3.

Kanzi -c -i foo.txt -f -t BWT+MTFT+ZRLT -b 4m -e FPAQ -j 4

Compress foo.txt to foo.txt.knz and overwrite the output file if it already exists. Use the Burrows-Wheeler, Move-to-front and Zero Run Length transforms followed by the FPAQ entropy codec. Use 4 jobs to compress the file.

Kanzi -d -i foo.knz -f -j 2

Decompress the file foo.knz to foo.knz.bak and force an overwrite. Use 2 jobs to decompress the file.

Package hierarchy

kanzi

app -> [io, transform, internal]
benchmark -> [entropy, transform, bitstream, hash, internal]
bitstream -> [util, internal]
entropy -> [internal, bitstream, util]
internal -> []
io -> [entropy, transform, bitstream, hash, internal]
transform -> [entropy, bitstream, internal]
hash-> []

Description:

kanzi: definitions of common classes and interfaces
app: contains the main applications (block compressor and decompressor)
benchmark: benchmarking code
bitstream: utilities to manipulate a stream of data at the bit level
entropy: implementation of several common entropy codecs
internal: non public utilities
io: implementation of io.Reader and io.Writer with block codec (de)compression
transform: implementation of common data transforms
hash: implementation of the xxHash32 and other utility functions

Block compressor

Usage

Kanzi -h

Kanzi 2.3.0 (c) Frederic Langlet

Credits: Matt Mahoney, Yann Collet, Jan Ondrus, Yuta Mori, Ilya Muravyov,
         Neal Burns, Fabian Giesen, Jarek Duda, Ilya Grebnov

   -h, --help
        Display this message

   -c, --compress
        Compress mode

   -d, --decompress
        Decompress mode

   -i, --input=<inputName>
        Mandatory name of the input file or directory or 'stdin'
        When the source is a directory, all files in it will be processed.
        Provide /. at the end of the directory name to avoid recursion.
        (EG: myDir/. => no recursion)

   -o, --output=<outputName>
        optional name of the output file or 'none' or 'stdout'.

   -j, --jobs=<jobs>
        Maximum number of jobs the program may start concurrently
        If 0 is provided, use all available cores (maximum is 64).
        (default is half of available cores).

   -v, --verbose=<level>
        Set the verbosity level [0..5]
        0=silent, 1=default, 2=display details, 3=display configuration,
        4=display block size and timings, 5=display extra information
        Verbosity is reduced to 1 when files are processed concurrently
        Verbosity is reduced to 0 when the output is 'stdout'

   -f, --force
        Overwrite the output file if it already exists

   --rm
        Remove the input file after successful (de)compression.
        If the input is a folder, all processed files under the folder are removed.

   --no-link
        Skip links

   --no-dot-file
        Skip dot files

Type kanzi -c -h or kanzi -d -h to get help specific to compression or decompression.

Available transforms

BWT: Burrows Wheeler Transform (https://en.wikipedia.org/wiki/Burrows-Wheeler_transform) is a transform that reorders symbols in a (reversible) way that is more amenable to entropy coding. This implementation uses a linear time foward transform and parallel inverse tranform.

BWTS: Burrows Wheeler Transform by Scott is a bijective variant of the BWT (slower).

LZ: Lempel Ziv (https://en.wikipedia.org/wiki/LZ77_and_LZ78). An implementation of the dictionary based LZ77 transform that removes redundancy in the data.

LZX: Lempel Ziv Extra. Same as above with a bigger hash table and/or extra match checks.

RLT: Run Length Transform (https://en.wikipedia.org/wiki/Run-length_encoding). A simple transform that replaces runs of similar symbols with a compact representation.

ZRLT: Zero Run Length Transform. Similar to RLT but only processes runs of 0. Usually used post BWT.

MTFT: Move-To-Front Transform (https://en.wikipedia.org/wiki/Move-to-front_transform). A transform that reduces entropy by assigning shorter symbols to recent data (works like a LRU cache). Usually used post BWT.

RANK: Rank Transform: A transform that that reduces entropy by assigning shorter symbols based on symbol frequency ranks. Usually used post BWT.

EXE: A transform that reduces the entropy of executable files (X86 & ARM64) by replacing relative jump addresses with absolute ones.

TEXT: A text transform that uses a dictionary to replace common words with their index.

ROLZ: Reduced Offset Lempel Ziv. An implementation of LZ that replaces match offsets with indexes of match offsets. It yields a more compact output at the cost of an extra indirection (hence slower decoding speeds).

ROLZX: Extended ROLZ with more match searches and a more compact encoding.

SRT: Sorted Rank Transform: A transform that that reduces entropy by assigning shorter symbols based on symbol frequency ranks. Usually used post BWT.

LZP: Lempel Ziv Prediction can be described as an LZ implementation with only one possible match (hence the offset is not even emitted). It is pretty weak (but fast) and usually used prior to a BWT to remove some redundancy and speed up coding.

MM: Multimedia transform is a fast transform that removes redundancy in correlated channels in some multimedia files (EG. wav, pnm).

UTF: a fast transform replacing UTF-8 codewords with aliases based on frequencies.

PACK: a fast transform replacing unused symbols with aliases based on frequencies.

Entropy codecs

Several entropy codecs have been implemented (sorted by increasing compression):

Huffman: The codec is a fast canonical Huffman implementation. Both encoder and decoder use tables to compute code lengths and values instead of a tree (for speed purpose).

RANGE: A fast implementation that uses pre-computed block statistics.

ANS: Based on Range Asymmetric Numeral Systems by Jarek Duda (specifically an implementation by Fabian Giesen). Works in a similar fashion to the Range encoder but uses only 1 state (instead of bottom and range) and does encoding in reverse byte order.

FPAQ: A binary arithmetic codec based on FPAQ1 by Matt Mahoney. Uses a simple, adaptive order 0 predictor based on frequencies. Fast and compact code.

CM: A binary arithmetic codec derived from BCM by Ilya Muravyov. Uses context mixing of counters to generate a prediction. Efficient and decently fast.

TPAQ: A binary arithmetic codec based initially on Tangelo 2.4 (itself derived from FPAQ8). Uses context mixing of predictions produced by one layer neural networks. The initial code has been heavily tuned to improve compression ratio and speed. Slowest but usually best compression ratio.

Compression levels

0 = NONE&NONE (store)

1 = PACK+LZ&NONE

2 = PACK+LZ&HUFFMAN

3 = TEXT+UTF+PACK+MM+LZX&HUFFMAN

4 = TEXT+UTF+EXE+PACK+MM+ROLZ&NONE

5 = TEXT+UTF+BWT+RANK+ZRLT&ANS0

6 = TEXT+UTF+BWT+SRT+ZRLT&FPAQ

7 = LZP+TEXT+UTF+BWT+LZP&CM

8 = EXE+RLT+TEXT+UTF&TPAQ

9 = EXE+RLT+TEXT+UTF&TPAQX

Bitstream format

stream header (once at the beginning of the stream)

bits	role	value	comment
32	magic number	0x4B414E5A	'KANZ'
4	bit stream version	5
1	block checksum ?	0 or 1
5	entropy	see table	one among 32
48	transforms (8*6 bits)	see table	up to 8 among 64
28	block size / 16	in bytes
2	bits for input size	0 to 3	not provided or >= 2^48: 0, <2^16: 1, <2^32: 2, <2^48: 3
0,16,32,48	input size(*)	0 to (2^48)-1	checked by the decompressor
16	header checksum	0 to 65535	checked by the decompressor

(*) If the input size is not available to the compressor, this field is skipped.

entropy:

NONE_TYPE = 0; No compression
HUFFMAN_TYPE = 1; Huffman
FPAQ_TYPE = 2; Fast PAQ (order 0)
PAQ_TYPE = 3; PAQ (Obsolete, not supported)
RANGE_TYPE = 4; Range
ANS0_TYPE = 5; Asymmetric Numerical System order 0
CM_TYPE = 6; Context Model
TPAQ_TYPE = 7; Tangelo PAQ
ANS1_TYPE = 8; Asymmetric Numerical System order 1
TPAQX_TYPE = 9; Tangelo PAQ Extra

transform:

NONE_TYPE = 0; Copy
BWT_TYPE = 1; Burrows Wheeler
BWTS_TYPE = 2; Burrows Wheeler Scott
LZ_TYPE = 3; Lempel Ziv
SNAPPY_TYPE = 4; Snappy (obsolete, not supported)
RLT_TYPE = 5; Run Length
ZRLT_TYPE = 6; Zero Run Length
MTFT_TYPE = 7; Move To Front
RANK_TYPE = 8; Rank
EXE_TYPE = 9; Codec for executables
DICT_TYPE = 10; Text codec
ROLZ_TYPE = 11; ROLZ codec
ROLZX_TYPE = 12; ROLZ Extra codec
SRT_TYPE = 13; Sorted Rank
LZP_TYPE = 14; Lempel Ziv Predict
MM_TYPE = 15; Multimedia (FSD) codec
LZX_TYPE = 16; Lempel Ziv Extra
UTF_TYPE = 17; UTF codec
PACK_TYPE = 18; Pack codec

block header (at the beginning of each block)

bits	optional?	role	comment
5	no	log(cbs) - 3	block size <= 1GB
log(cbs)	no	cbs	compressed block size (cbs), if 0 last block
8*cbs	no	block data	if cbs = 0, no data

block data

bits	optional?	role	comment
8	no	block mode	see format below
8	yes	block transform skip flags	present if more than 4 transforms only
8*ds	no	data	ds = 1+((mode>>5)&0x03)
32	yes	block checksum	XXHash32 checksum

// mode | 0b10000000 => copy block
//      | 0b0yy00000 => size(size(block))-1
//      | 0b000y0000 => 1 if more than 4 transforms
//  case 4 transforms or less
//      | 0b0000yyyy => transform sequence skip flags (1 means skip)
//  case more than 4 transforms, read next byte
//      | 0byyyyyyyy => transform sequence skip flags (1 means skip)

To decode a block

Decode the block header
Read the block data
Entropy decode the data
Run sequentially every inverse transform on the decoded data (except those with a skip flag set to 1)
If a checksum was present, compute checksum of the decompressed data and compare with the checksum provided in the bitstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly