Difference in behavior between 2.x and 3.x using identical compressor settings #2766

y4n9squared · 2025-01-25T17:43:10Z

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.13

Operating System

Linux

Installation

Using pip into virtual environment

Description

Writing the same data using identical compressor settings in Zarr-Python 2.x and 3.x yields differences in compression results.

Using zarr==2.18.4, numcodecs==0.13.1, this produces two chunks each of size 1.7 KiB on disk. The metadata is:

{
    "attributes": {},
    "chunk_grid": {
        "chunk_shape": [
            64,
            64,
            2
        ],
        "separator": "/",
        "type": "regular"
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/blosc/1.0",
        "configuration": {
            "blocksize": 0,
            "clevel": 1,
            "cname": "lz4",
            "shuffle": 1
        }
    },
    "data_type": "<f8",
    "extensions": [],
    "fill_value": 0.0,
    "shape": [
        64,
        64,
        4
    ]
}

Using zarr==3.0.1, numcodecs==0.15.0, this produces two chunks of size 30 KiB.

{
  "shape": [
    64,
    64,
    4
  ],
  "data_type": "float64",
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        64,
        64,
        2
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": 0.0,
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "blosc",
      "configuration": {
        "typesize": 8,
        "cname": "lz4",
        "clevel": 1,
        "shuffle": "shuffle",
        "blocksize": 0
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Steps to reproduce

MVCE

Run the following code in a 2.x and 3.x environment and inspect contents of /tmp/foo.zarr.

import numpy as np

import zarr

store = "/tmp/foo.zarr"
shape = (64, 64, 4)
chunks = (64, 64, 2)
dtype = np.float64

cname = "lz4"
clevel = 1

if zarr.__version__[0] == "2":
    import numcodecs.blosc

    compressor = numcodecs.Blosc(
        cname=cname,
        clevel=clevel,
        shuffle=numcodecs.blosc.SHUFFLE,
    )

    za = zarr.open(
        store,
        mode="w",
        zarr_version=3,
        shape=shape,
        chunks=chunks,
        dtype=dtype,
        compressor=compressor,
    )
else:
    import zarr.codecs

    compressors = zarr.codecs.BloscCodec(
        cname=cname,
        clevel=clevel,
        shuffle=zarr.codecs.BloscShuffle.shuffle,
    )

    za = zarr.create_array(
        store,
        shape=shape,
        chunks=chunks,
        dtype=dtype,
        zarr_format=3,
        compressors=compressors,
    )

arr = np.arange(np.prod(shape)).reshape(shape)
za[:] = arr

Additional output

No response

The text was updated successfully, but these errors were encountered:

d-v-b · 2025-01-25T18:35:56Z

weird, since the codec config used by blosc is identical in both cases:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==3.0.1",
# ]
# ///

import numpy as np

import zarr

cname = "lz4"
clevel = 1

import numcodecs.blosc

blosc_numcodecs = numcodecs.Blosc(
    cname=cname,
    clevel=clevel,
    shuffle=numcodecs.blosc.SHUFFLE,
)

blosc_zarr = zarr.codecs.BloscCodec(
    cname=cname,
    clevel=clevel,
    shuffle=zarr.codecs.BloscShuffle.shuffle,
)
assert blosc_numcodecs.get_config() == blosc_zarr._blosc_codec.get_config()
print(blosc_numcodecs.get_config())
# {'id': 'blosc', 'cname': 'lz4', 'clevel': 1, 'shuffle': 1, 'blocksize': 0}

y4n9squared · 2025-01-25T19:22:08Z

Yes, I found that mysterious as well. Are the implementations/underlying libraries also the same?

I wonder if the "auto block size" is resolving to be different values. That could materially impact the compression ratio.

If you don't already know off-hand, the block size should be encoded in the Blosc frame header, so I can report back when I have some time to dig.

dstansby · 2025-01-25T19:37:02Z

Could you try the zarr-python 2 example with numcodecs 0.15.0, to help isolate if it's a change in numcodecs or zarr-python?

y4n9squared · 2025-01-26T15:08:10Z

Could you try the zarr-python 2 example with numcodecs 0.15.0, to help isolate if it's a change in numcodecs or zarr-python?

I am getting the same behavior after upgrading the 2.x environment to use numcodecs==0.15.0. Compressed chunk using

compressor = Blosc(cname="lz4", clevel=1, shuffle=SHUFFLE)

is still 1.7 KiB (i.e. no change as compared to numcodecs==0.13).

y4n9squared · 2025-01-26T17:28:15Z

Here's the hex dump of the Blosc header for the first chunk in both:

❯ hexdump -C /tmp/foo.zarr/data/root/c0/0/1 | head -n1
00000000  02 01 21 08 00 00 01 00  00 00 01 00 b0 06 00 00  |..!.............|

❯ hexdump -C /tmp/bar.zarr/c/0/0/1 | head -n1
00000000  02 01 21 01 00 00 01 00  00 00 01 00 7e 80 00 00  |..!.........~...|

Based on the byte layout in the documentation:

It seems like there is a difference in what is encoded as the "type size" (fourth byte). In Zarr-Python 2.x, this value is 8 bytes (seems correct given that we are encoding np.float64). In Zarr-Python 3.x, the value is 1 byte.

The rest of the header values seem to check out:

Byte 1: Blosc header version 2
Byte 2: Typically 1
Byte 3: Flags (0x21 means bits 0 and bits 5 are set --> byte shuffle + lz4)
Bytes 5-8: 64 KiB (64 * 64 * 2 * 8 bytes, our chunk size)
Bytes 9-12: 64 KiB block size
Bytes 13-16: Compressed size (0x000006b0 == 1.7 KiB little endian and 0x0000807e == 32 KiB little endian)

TLDR; 4th byte in Blosc header (type size) seems to differ in 2.x and 3.x, causing differences in compression ratio.

dstansby · 2025-01-26T17:30:42Z

The type size in zarr-python v3 comes from this line:

zarr-python/src/zarr/codecs/blosc.py

Line 186 in 80aea2a

self._blosc_codec.encode(chunk.as_numpy_array())

My guess is as_numpy_array() will always return an array with a 8-bit data type (in this case, it endds up as int8)?

dstansby · 2025-01-26T17:41:42Z

So it seems like zarr-python 3 is doing a dance where it goes array (w/ 64 bit dytpe) > buffer > array (w/ 8 bit dtype). Because blosc is getting an 8-bit array, it's choosing a blocksize of 8-bits, which is not an efficient way to compress data that was originally 64-bit. So to fix this, we need a way to pass a 64-bit array to numcodecs/blosc instead.

That's about the limit of my debugging - someone who knows more about how/why codecs were implemented for zarr-python 3 might need to help. I took a look at the blame, and it seems like @normanrz implemented this in #1588.

y4n9squared · 2025-01-26T17:44:00Z

Yes, confirmed. The value of 1 byte for the type size is being determined here because the source buffer dtype is int8.

it's choosing a blocksize of 8-bits, which is not an efficient way to compress data that was originally 64-bit

I don't think it's the block size that's 8 bits, right? The block size which we have the codec determine is setting it to be 64 KiB in both cases. I think the type size being 8 bits vs. 64 bits determines what the shuffle looks at, and since I'm encoding low valued integers, the shuffle looking 8 bits at a time will be mostly ineffective since only a few mantissa bits are set.

dstansby · 2025-01-26T17:51:41Z

That sounds right. So to fix this, we need a way of giving a 64-bit dtype (/whatever the dtype of the input data happens to be) array to numcodecs.

d-v-b · 2025-01-26T19:21:06Z

if we fix this, will it break data that people already saved?

dstansby · 2025-01-26T19:27:00Z

I don't think so - we're still saving valid "blosc data", it's just not very well compressed data.

normanrz · 2025-01-27T08:38:14Z

That sounds right. So to fix this, we need a way of giving a 64-bit dtype (/whatever the dtype of the input data happens to be) array to numcodecs.

Or modify numcodecs to accept an itemsize (or typesize) argument.

dstansby · 2025-01-27T10:34:34Z

Is there a reason BloscCodec isn't (or couldn't) be a array --> bytes codec instead? Would that be a way of passing the original array instead of doing the array --> bytes --> array --> numcodecs dance

d-v-b · 2025-01-27T10:38:25Z

it's not clear that the typesize parameter should always be set to the size of the dtype of the incoming array. For example, variable-length strings don't have a fixed dtype size, but it would still make sense to pick a good typesize parameter for the blosc compressor. IMO typesize is just a parameter of the compressor like any other, it just happens that in many cases it should default to the size of the dtype of an ndimensional array (if that array has a fixed-size dtype).

normanrz · 2025-01-27T10:41:58Z

Is there a reason BloscCodec isn't (or couldn't) be a array --> bytes codec instead? Would that be a way of passing the original array instead of doing the array --> bytes --> array --> numcodecs dance

That would be a spec change, which I am not enthusiastic about.

dstansby · 2025-01-27T10:53:43Z

Given

Zarr implementations MAY allow users to leave this unspecified and have the implementation choose a value automatically based on the array data type and previous codecs in the chain, but MUST record in the metadata the value that is chosen.

It seems like the fix is for zarr-python to intelligently choose a typesize depending on the array data type then?

d-v-b · 2025-01-27T10:55:35Z

The default should be auto, but i think it should also be exposed as a parameter so users can control it.

dstansby · 2025-01-27T11:06:40Z

I don't disagree, but if we defer to blosc to set blocksize we need to pass an array that has the same data type as the original array before it was converted to bytes by zarr-python, and it's not clear to me how to do that?

normanrz · 2025-01-27T12:05:48Z

I think zarr-python sh ould set that default and pass it on to numcodecs. But numcodecs needs to have an itemsize arg to actually have that set. Btw itemsize/typesize is not the same as blocksize.

dstansby · 2025-01-27T13:05:30Z

👍 - just to understand how that would work for a user, they would now have to explicitly set the itemsize on the codec, and if they don't (e.g., following the same way as specifying the blosc codec as in zarr-python 2) it will result in (potentially) a very bad compression (as in the reproducer at the top of this issue)?

normanrz · 2025-01-27T13:08:09Z

Actually, there is already code for auto-detecting the itemsize: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/codecs/blosc.py#L141-L142 It just doesn't get passed on to numcodec.

dstansby · 2025-01-27T13:25:12Z

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

    {
      "name": "blosc",
      "configuration": {
        "typesize": 8,
        "cname": "lz4",
        "clevel": 1,
        "shuffle": "shuffle",
        "blocksize": 0
      }

Sounds like we have a path forward for a fix at least 🎉

jhamman · 2025-01-27T17:20:16Z

cross referencing this thread: #2171

normanrz · 2025-01-27T17:26:36Z

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

8 is correct, because it is in bytes.

y4n9squared · 2025-01-27T18:55:58Z

Ah awesome! Potentially it isn't working correctly in zarr-python either, because in the example at the top the typesize ends up as 8 when it should be 64?

8 is correct, because it is in bytes.

8 bytes is correct, but what's stored in zarr.json doesn't match the actual values that are serialized in to the Blosc header, which is 1 byte. Part of my initial confusion was looking at the metadata values and concluding that it wasn't a Blosc settings issue.

y4n9squared added the bug Potential issues with the zarr-python library label Jan 25, 2025

y4n9squared mentioned this issue Jan 25, 2025

Performance regression in V3 #2710

Open

dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in behavior between 2.x and 3.x using identical compressor settings #2766

Difference in behavior between 2.x and 3.x using identical compressor settings #2766

y4n9squared commented Jan 25, 2025

d-v-b commented Jan 25, 2025 •

edited

Loading

y4n9squared commented Jan 25, 2025 •

edited

Loading

dstansby commented Jan 25, 2025

y4n9squared commented Jan 26, 2025

y4n9squared commented Jan 26, 2025

dstansby commented Jan 26, 2025

dstansby commented Jan 26, 2025 •

edited

Loading

y4n9squared commented Jan 26, 2025 •

edited

Loading

dstansby commented Jan 26, 2025 •

edited

Loading

d-v-b commented Jan 26, 2025

dstansby commented Jan 26, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

d-v-b commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

d-v-b commented Jan 27, 2025

dstansby commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

jhamman commented Jan 27, 2025

normanrz commented Jan 27, 2025

y4n9squared commented Jan 27, 2025

Difference in behavior between 2.x and 3.x using identical compressor settings #2766

Difference in behavior between 2.x and 3.x using identical compressor settings #2766

Comments

y4n9squared commented Jan 25, 2025

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

d-v-b commented Jan 25, 2025 • edited Loading

y4n9squared commented Jan 25, 2025 • edited Loading

dstansby commented Jan 25, 2025

y4n9squared commented Jan 26, 2025

y4n9squared commented Jan 26, 2025

dstansby commented Jan 26, 2025

dstansby commented Jan 26, 2025 • edited Loading

y4n9squared commented Jan 26, 2025 • edited Loading

dstansby commented Jan 26, 2025 • edited Loading

d-v-b commented Jan 26, 2025

dstansby commented Jan 26, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

d-v-b commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

d-v-b commented Jan 27, 2025

dstansby commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

normanrz commented Jan 27, 2025

dstansby commented Jan 27, 2025

jhamman commented Jan 27, 2025

normanrz commented Jan 27, 2025

y4n9squared commented Jan 27, 2025

d-v-b commented Jan 25, 2025 •

edited

Loading

y4n9squared commented Jan 25, 2025 •

edited

Loading

dstansby commented Jan 26, 2025 •

edited

Loading

y4n9squared commented Jan 26, 2025 •

edited

Loading

dstansby commented Jan 26, 2025 •

edited

Loading