compression increases size? #6

lgarrison · 2025-02-11T20:21:09Z

I have an int32 dataset where simple_ans compression seems to increase the size, rather than decrease. Is this expected for some datasets, or am I "holding it wrong"?

Here's a minimal reproducer (data is on rusty):

import numpy as np
from simple_ans import ans_encode

data = np.load('/mnt/home/lgarrison/ceph/simple_ans/data.npy')
print(data.nbytes)
print(data.shape)
print(data.dtype)
print(data.flags)

comp = ans_encode(data)
comp_nbytes = comp.size()
print(f'Input: {data.nbytes / (1 << 20):.1f} MiB')
print(f'Output: {comp_nbytes / (1 << 20):.1f} MiB')
print(f'Compression ratio: {data.nbytes / comp_nbytes:.3g}x')

❯ python issue.py 
40000000
(10000000,)
int32
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

Input: 38.1 MiB
Output: 171.9 MiB
Compression ratio: 0.222x

I did notice that the README has this code:

# Get compression stats
original_size = signal.nbytes
compressed_size = encoded.size()  # in bits
compression_ratio = original_size / compressed_size
print(f"Compression ratio: {compression_ratio:.2f}x")

The comment suggests that the size is measured in bits. However, the rest of that code (and the source of size() function) suggests it's in bytes. If it really is in bits, though, this certainly could be my issue.

The text was updated successfully, but these errors were encountered:

magland · 2025-02-11T21:33:38Z

Oops, I think the "in bits" should be "in bytes", so yeah I'm curious why it is expanding the size so much.

This will depend on the distribution of values. The main use case is when there are <100 or <1000 distinct values in the dataset (I should clarify that in the docs). Would you be able to check how many distinct values are in this data?

lgarrison · 2025-02-11T22:44:59Z

Oh, in that case it definitely will not perform well on this data! There are 7594235 unique values, which is 75% of the inputs. There are definitely patterns in the input, though: the data are quantized positions and velocities from a 3D N-body simulation. blosc (i.e. byte transpose) + zstd -1 gets about 1.6x compression. The benchmarks in the readme made me curious if simple_ans could do better.

magland · 2025-02-12T13:00:26Z

I have added the following to the readme

Important: This implementation is designed for data with approximately 2 to 5000 distinct values. Performance may degrade significantly with datasets containing more unique values.

Let me know if you come across any real datasets that are suitable.

I'm also working on this related project:
https://magland.github.io/benchcompress/

lgarrison · 2025-02-12T14:01:07Z

Thanks!

lgarrison closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression increases size? #6

compression increases size? #6

lgarrison commented Feb 11, 2025

magland commented Feb 11, 2025

lgarrison commented Feb 11, 2025

magland commented Feb 12, 2025

lgarrison commented Feb 12, 2025

compression increases size? #6

compression increases size? #6

Comments

lgarrison commented Feb 11, 2025

magland commented Feb 11, 2025

lgarrison commented Feb 11, 2025

magland commented Feb 12, 2025

lgarrison commented Feb 12, 2025