Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compression increases size? #6

Closed
lgarrison opened this issue Feb 11, 2025 · 4 comments
Closed

compression increases size? #6

lgarrison opened this issue Feb 11, 2025 · 4 comments

Comments

@lgarrison
Copy link
Member

I have an int32 dataset where simple_ans compression seems to increase the size, rather than decrease. Is this expected for some datasets, or am I "holding it wrong"?

Here's a minimal reproducer (data is on rusty):

import numpy as np
from simple_ans import ans_encode

data = np.load('/mnt/home/lgarrison/ceph/simple_ans/data.npy')
print(data.nbytes)
print(data.shape)
print(data.dtype)
print(data.flags)

comp = ans_encode(data)
comp_nbytes = comp.size()
print(f'Input: {data.nbytes / (1 << 20):.1f} MiB')
print(f'Output: {comp_nbytes / (1 << 20):.1f} MiB')
print(f'Compression ratio: {data.nbytes / comp_nbytes:.3g}x')
❯ python issue.py 
40000000
(10000000,)
int32
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

Input: 38.1 MiB
Output: 171.9 MiB
Compression ratio: 0.222x

I did notice that the README has this code:

# Get compression stats
original_size = signal.nbytes
compressed_size = encoded.size()  # in bits
compression_ratio = original_size / compressed_size
print(f"Compression ratio: {compression_ratio:.2f}x")

The comment suggests that the size is measured in bits. However, the rest of that code (and the source of size() function) suggests it's in bytes. If it really is in bits, though, this certainly could be my issue.

@magland
Copy link
Collaborator

magland commented Feb 11, 2025

Oops, I think the "in bits" should be "in bytes", so yeah I'm curious why it is expanding the size so much.

This will depend on the distribution of values. The main use case is when there are <100 or <1000 distinct values in the dataset (I should clarify that in the docs). Would you be able to check how many distinct values are in this data?

@lgarrison
Copy link
Member Author

Oh, in that case it definitely will not perform well on this data! There are 7594235 unique values, which is 75% of the inputs. There are definitely patterns in the input, though: the data are quantized positions and velocities from a 3D N-body simulation. blosc (i.e. byte transpose) + zstd -1 gets about 1.6x compression. The benchmarks in the readme made me curious if simple_ans could do better.

@magland
Copy link
Collaborator

magland commented Feb 12, 2025

I have added the following to the readme

Important: This implementation is designed for data with approximately 2 to 5000 distinct values. Performance may degrade significantly with datasets containing more unique values.

Let me know if you come across any real datasets that are suitable.

I'm also working on this related project:
https://magland.github.io/benchcompress/

@lgarrison
Copy link
Member Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants