Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in behavior between 2.x and 3.x using identical compressor settings #2766

Open
y4n9squared opened this issue Jan 25, 2025 · 25 comments
Labels
bug Potential issues with the zarr-python library performance Potential issues with Zarr performance (I/O, memory, etc.)

Comments

@y4n9squared
Copy link

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.13

Operating System

Linux

Installation

Using pip into virtual environment

Description

Writing the same data using identical compressor settings in Zarr-Python 2.x and 3.x yields differences in compression results.

Using zarr==2.18.4, numcodecs==0.13.1, this produces two chunks each of size 1.7 KiB on disk. The metadata is:

{
    "attributes": {},
    "chunk_grid": {
        "chunk_shape": [
            64,
            64,
            2
        ],
        "separator": "/",
        "type": "regular"
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/blosc/1.0",
        "configuration": {
            "blocksize": 0,
            "clevel": 1,
            "cname": "lz4",
            "shuffle": 1
        }
    },
    "data_type": "<f8",
    "extensions": [],
    "fill_value": 0.0,
    "shape": [
        64,
        64,
        4
    ]
}

Using zarr==3.0.1, numcodecs==0.15.0, this produces two chunks of size 30 KiB.

{
  "shape": [
    64,
    64,
    4
  ],
  "data_type": "float64",
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        64,
        64,
        2
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": 0.0,
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "blosc",
      "configuration": {
        "typesize": 8,
        "cname": "lz4",
        "clevel": 1,
        "shuffle": "shuffle",
        "blocksize": 0
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Steps to reproduce

MVCE

Run the following code in a 2.x and 3.x environment and inspect contents of /tmp/foo.zarr.

import numpy as np

import zarr

store = "/tmp/foo.zarr"
shape = (64, 64, 4)
chunks = (64, 64, 2)
dtype = np.float64

cname = "lz4"
clevel = 1

if zarr.__version__[0] == "2":
    import numcodecs.blosc

    compressor = numcodecs.Blosc(
        cname=cname,
        clevel=clevel,
        shuffle=numcodecs.blosc.SHUFFLE,
    )

    za = zarr.open(
        store,
        mode="w",
        zarr_version=3,
        shape=shape,
        chunks=chunks,
        dtype=dtype,
        compressor=compressor,
    )
else:
    import zarr.codecs

    compressors = zarr.codecs.BloscCodec(
        cname=cname,
        clevel=clevel,
        shuffle=zarr.codecs.BloscShuffle.shuffle,
    )

    za = zarr.create_array(
        store,
        shape=shape,
        chunks=chunks,
        dtype=dtype,
        zarr_format=3,
        compressors=compressors,
    )

arr = np.arange(np.prod(shape)).reshape(shape)
za[:] = arr

Additional output

No response

@y4n9squared y4n9squared added the bug Potential issues with the zarr-python library label Jan 25, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 25, 2025

weird, since the codec config used by blosc is identical in both cases:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==3.0.1",
# ]
# ///

import numpy as np

import zarr

cname = "lz4"
clevel = 1

import numcodecs.blosc

blosc_numcodecs = numcodecs.Blosc(
    cname=cname,
    clevel=clevel,
    shuffle=numcodecs.blosc.SHUFFLE,
)

blosc_zarr = zarr.codecs.BloscCodec(
    cname=cname,
    clevel=clevel,
    shuffle=zarr.codecs.BloscShuffle.shuffle,
)
assert blosc_numcodecs.get_config() == blosc_zarr._blosc_codec.get_config()
print(blosc_numcodecs.get_config())
# {'id': 'blosc', 'cname': 'lz4', 'clevel': 1, 'shuffle': 1, 'blocksize': 0}

@y4n9squared
Copy link
Author

y4n9squared commented Jan 25, 2025

Yes, I found that mysterious as well. Are the implementations/underlying libraries also the same?

I wonder if the "auto block size" is resolving to be different values. That could materially impact the compression ratio.

If you don't already know off-hand, the block size should be encoded in the Blosc frame header, so I can report back when I have some time to dig.

@dstansby
Copy link
Contributor

Could you try the zarr-python 2 example with numcodecs 0.15.0, to help isolate if it's a change in numcodecs or zarr-python?

@y4n9squared
Copy link
Author

Could you try the zarr-python 2 example with numcodecs 0.15.0, to help isolate if it's a change in numcodecs or zarr-python?

I am getting the same behavior after upgrading the 2.x environment to use numcodecs==0.15.0. Compressed chunk using

compressor = Blosc(cname="lz4", clevel=1, shuffle=SHUFFLE)

is still 1.7 KiB (i.e. no change as compared to numcodecs==0.13).

@y4n9squared
Copy link
Author

Here's the hex dump of the Blosc header for the first chunk in both:

❯ hexdump -C /tmp/foo.zarr/data/root/c0/0/1 | head -n1
00000000  02 01 21 08 00 00 01 00  00 00 01 00 b0 06 00 00  |..!.............|

❯ hexdump -C /tmp/bar.zarr/c/0/0/1 | head -n1
00000000  02 01 21 01 00 00 01 00  00 00 01 00 7e 80 00 00  |..!.........~...|

Based on the byte layout in the documentation:

Image

It seems like there is a difference in what is encoded as the "type size" (fourth byte). In Zarr-Python 2.x, this value is 8 bytes (seems correct given that we are encoding np.float64). In Zarr-Python 3.x, the value is 1 byte.

The rest of the header values seem to check out:

Byte 1: Blosc header version 2
Byte 2: Typically 1
Byte 3: Flags (0x21 means bits 0 and bits 5 are set --> byte shuffle + lz4)
Bytes 5-8: 64 KiB (64 * 64 * 2 * 8 bytes, our chunk size)
Bytes 9-12: 64 KiB block size
Bytes 13-16: Compressed size (0x000006b0 == 1.7 KiB little endian and 0x0000807e == 32 KiB little endian)

TLDR; 4th byte in Blosc header (type size) seems to differ in 2.x and 3.x, causing differences in compression ratio.

@dstansby
Copy link
Contributor

The type size in zarr-python v3 comes from this line:

self._blosc_codec.encode(chunk.as_numpy_array())

My guess is as_numpy_array() will always return an array with a 8-bit data type (in this case, it endds up as int8)?

@dstansby
Copy link
Contributor

dstansby commented Jan 26, 2025

So it seems like zarr-python 3 is doing a dance where it goes array (w/ 64 bit dytpe) > buffer > array (w/ 8 bit dtype). Because blosc is getting an 8-bit array, it's choosing a blocksize of 8-bits, which is not an efficient way to compress data that was originally 64-bit. So to fix this, we need a way to pass a 64-bit array to numcodecs/blosc instead.

That's about the limit of my debugging - someone who knows more about how/why codecs were implemented for zarr-python 3 might need to help. I took a look at the blame, and it seems like @normanrz implemented this in #1588.

@y4n9squared
Copy link
Author

y4n9squared commented Jan 26, 2025

Yes, confirmed. The value of 1 byte for the type size is being determined here because the source buffer dtype is int8.

it's choosing a blocksize of 8-bits, which is not an efficient way to compress data that was originally 64-bit

I don't think it's the block size that's 8 bits, right? The block size which we have the codec determine is setting it to be 64 KiB in both cases. I think the type size being 8 bits vs. 64 bits determines what the shuffle looks at, and since I'm encoding low valued integers, the shuffle looking 8 bits at a time will be mostly ineffective since only a few mantissa bits are set.

@dstansby
Copy link
Contributor

dstansby commented Jan 26, 2025

That sounds right. So to fix this, we need a way of giving a 64-bit dtype (/whatever the dtype of the input data happens to be) array to numcodecs.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 26, 2025

if we fix this, will it break data that people already saved?

@dstansby
Copy link
Contributor

I don't think so - we're still saving valid "blosc data", it's just not very well compressed data.

@dstansby dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Jan 26, 2025
@normanrz
Copy link
Member

That sounds right. So to fix this, we need a way of giving a 64-bit dtype (/whatever the dtype of the input data happens to be) array to numcodecs.

Or modify numcodecs to accept an itemsize (or typesize) argument.

@dstansby
Copy link
Contributor

Is there a reason BloscCodec isn't (or couldn't) be a array --> bytes codec instead? Would that be a way of passing the original array instead of doing the array --> bytes --> array --> numcodecs dance

@d-v-b
Copy link
Contributor

d-v-b commented Jan 27, 2025

it's not clear that the typesize parameter should always be set to the size of the dtype of the incoming array. For example, variable-length strings don't have a fixed dtype size, but it would still make sense to pick a good typesize parameter for the blosc compressor. IMO typesize is just a parameter of the compressor like any other, it just happens that in many cases it should default to the size of the dtype of an ndimensional array (if that array has a fixed-size dtype).

@normanrz
Copy link
Member

Is there a reason BloscCodec isn't (or couldn't) be a array --> bytes codec instead? Would that be a way of passing the original array instead of doing the array --> bytes --> array --> numcodecs dance

That would be a spec change, which I am not enthusiastic about.

@dstansby
Copy link
Contributor

Given

Zarr implementations MAY allow users to leave this unspecified and have the implementation choose a value automatically based on the array data type and previous codecs in the chain, but MUST record in the metadata the value that is chosen.

It seems like the fix is for zarr-python to intelligently choose a typesize depending on the array data type then?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 27, 2025

The default should be auto, but i think it should also be exposed as a parameter so users can control it.

@dstansby
Copy link
Contributor

I don't disagree, but if we defer to blosc to set blocksize we need to pass an array that has the same data type as the original array before it was converted to bytes by zarr-python, and it's not clear to me how to do that?

@normanrz
Copy link
Member

I think zarr-python sh ould set that default and pass it on to numcodecs. But numcodecs needs to have an itemsize arg to actually have that set. Btw itemsize/typesize is not the same as blocksize.

@dstansby
Copy link
Contributor

👍 - just to understand how that would work for a user, they would now have to explicitly set the itemsize on the codec, and if they don't (e.g., following the same way as specifying the blosc codec as in zarr-python 2) it will result in (potentially) a very bad compression (as in the reproducer at the top of this issue)?

@normanrz
Copy link
Member

Actually, there is already code for auto-detecting the itemsize: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/codecs/blosc.py#L141-L142 It just doesn't get passed on to numcodec.

@dstansby
Copy link
Contributor

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

    {
      "name": "blosc",
      "configuration": {
        "typesize": 8,
        "cname": "lz4",
        "clevel": 1,
        "shuffle": "shuffle",
        "blocksize": 0
      }

Sounds like we have a path forward for a fix at least 🎉

@jhamman
Copy link
Member

jhamman commented Jan 27, 2025

cross referencing this thread: #2171

@normanrz
Copy link
Member

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

8 is correct, because it is in bytes.

@y4n9squared
Copy link
Author

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

8 is correct, because it is in bytes.

8 bytes is correct, but what's stored in zarr.json doesn't match the actual values that are serialized in to the Blosc header, which is 1 byte. Part of my initial confusion was looking at the metadata values and concluding that it wasn't a Blosc settings issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

5 participants