Skip to content

Latest commit

 

History

History
122 lines (96 loc) · 4.69 KB

README.md

File metadata and controls

122 lines (96 loc) · 4.69 KB

Huffman coding

In summary, its fast, has no dependencies and works with files that don't fit into memory.

Why another implementation you ask?

  • All other Python implementation's of the Huffman coding algorithm are written for educational purposes, whereas this package is actually intended to be used.

  • It can be run as a standalone script because it is a pure Python implementation with no additional dependencies outside the standard library.

  • It works with streams exclusively, in a buffered fashion. Meaning that you can encode and decode files (or streams) that are larger than memory. The total memory consumption peaks at just 70KB.

  • In contrast to other implementations, the encoding step actually encodes the frequency table as well (which the decode step expects to read). This means that the original text does not need to be kept for decoding.

    Let's say you want to share a file over network, then it often improves performance to first compress the file before sending it over the wire. However, you want to be able to decompress/decode the received file without requiring the original (since that would destroy the entire purpose of compressing in the first place). Be sure to check out the client-server example showcasing the above by compressing and sending a file over a socket.

  • The decoding step is significantly faster (4x) by using an FSM so it can operate on an entire byte at once, instead of decoding bit-by-bit.

    This is not a novel idea though, although simply not implemented by other pacakges. Basically, I noticed that the RFC for HPACK: Header Compression for HTTP/2 included a static Huffman code which was being decoded using an FSM in Python's HPACK implementation which essentially "unrolled" the FSM by operating on a byte-level instead of bit-level. I simply had to generate a FSM dynamically based on the Huffman code (see _get_fsm_decoder()) to be able to apply the same principle to non-static Huffman codes.

Installation

git clone https://github.com/yannickperrenet/huffman-coding && cd huffman_coding
pip install .
# Or for poetry users
# poetry install

Usage

Let's use Hamlet as a test text.

# Make sure you are in the root huffman_coding directory.
wget https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc1d6429ef0167e709abf9b/hamlet.txt -O hamlet.txt

CLI

# Encode a given file and write to an output file.
huffman_coding encode --input="hamlet.txt" --output="hamlet.raw"
# Decode from file and write to STDOUT.
huffman_coding decode --input="hamlet.raw"
# Encode file and decode immediately again.
huffman_coding encode --input="hamlet.txt" | ./huffman_coding.py decode

Python

Files

# NOTE: Make sure to pass `newline=""` to `open()` to prevent newline
# translation since that messes with encoding/decoding. Read more here:
# https://docs.python.org/3/library/io.html#io.TextIOWrapper
# NOTE: Disable buffering because `encode()` and `decode()` already
# work in a buffered fashion.
with (
    open("hamlet.txt", mode="r", newline="") as f_in,
    open("hamlet.raw", mode="wb", buffering=0) as f_out
):
    encode(f_in=f_in, f_out=f_out)

with (
    open("hamlet.raw", mode="rb", buffering=0) as f_in,
    open("hamlet.txt", mode="w", newline="") as f_out
):
    decode(f_in=f_in, f_out=f_out)

I/O streams

import io

with open("hamlet.txt", mode="r", newline="") as f:
    text = f.read()

text_stream = io.StringIO()
text_stream.write(text)
# Seek start of stream, otherwise successive reads won't return
# anything.
text_stream.seek(0)

byte_encoding = io.BytesIO()
encode(f_in=text_stream, f_out=byte_encoding, buffering=buffering)
byte_encoding.seek(0)

decoded_text = io.StringIO()
decode(f_in=byte_encoding, f_out=decoded_text)
decoded_text.seek(0)

Running tests

Simply run python3 tests/test_huffman_coding.py or invoke unittest through its CLI python3 -m unittest -vv (make sure that the Python you are running has the package installed, i.e. pip install . in the root directory of this project).

Resources