-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in behavior between 2.x and 3.x using identical compressor settings #2766
Comments
weird, since the codec config used by blosc is identical in both cases: # /// script
# requires-python = ">=3.12"
# dependencies = [
# "zarr==3.0.1",
# ]
# ///
import numpy as np
import zarr
cname = "lz4"
clevel = 1
import numcodecs.blosc
blosc_numcodecs = numcodecs.Blosc(
cname=cname,
clevel=clevel,
shuffle=numcodecs.blosc.SHUFFLE,
)
blosc_zarr = zarr.codecs.BloscCodec(
cname=cname,
clevel=clevel,
shuffle=zarr.codecs.BloscShuffle.shuffle,
)
assert blosc_numcodecs.get_config() == blosc_zarr._blosc_codec.get_config()
print(blosc_numcodecs.get_config())
# {'id': 'blosc', 'cname': 'lz4', 'clevel': 1, 'shuffle': 1, 'blocksize': 0} |
Yes, I found that mysterious as well. Are the implementations/underlying libraries also the same? I wonder if the "auto block size" is resolving to be different values. That could materially impact the compression ratio. If you don't already know off-hand, the block size should be encoded in the Blosc frame header, so I can report back when I have some time to dig. |
Could you try the zarr-python 2 example with numcodecs 0.15.0, to help isolate if it's a change in numcodecs or zarr-python? |
I am getting the same behavior after upgrading the 2.x environment to use
is still 1.7 KiB (i.e. no change as compared to |
Here's the hex dump of the Blosc header for the first chunk in both: ❯ hexdump -C /tmp/foo.zarr/data/root/c0/0/1 | head -n1
00000000 02 01 21 08 00 00 01 00 00 00 01 00 b0 06 00 00 |..!.............|
❯ hexdump -C /tmp/bar.zarr/c/0/0/1 | head -n1
00000000 02 01 21 01 00 00 01 00 00 00 01 00 7e 80 00 00 |..!.........~...| Based on the byte layout in the documentation: It seems like there is a difference in what is encoded as the "type size" (fourth byte). In Zarr-Python 2.x, this value is 8 bytes (seems correct given that we are encoding The rest of the header values seem to check out: Byte 1: Blosc header version 2 TLDR; 4th byte in Blosc header (type size) seems to differ in 2.x and 3.x, causing differences in compression ratio. |
The type size in zarr-python v3 comes from this line: zarr-python/src/zarr/codecs/blosc.py Line 186 in 80aea2a
My guess is |
So it seems like That's about the limit of my debugging - someone who knows more about how/why codecs were implemented for zarr-python 3 might need to help. I took a look at the blame, and it seems like @normanrz implemented this in #1588. |
Yes, confirmed. The value of 1 byte for the type size is being determined here because the source buffer dtype is
I don't think it's the block size that's 8 bits, right? The block size which we have the codec determine is setting it to be 64 KiB in both cases. I think the type size being 8 bits vs. 64 bits determines what the shuffle looks at, and since I'm encoding low valued integers, the shuffle looking 8 bits at a time will be mostly ineffective since only a few mantissa bits are set. |
That sounds right. So to fix this, we need a way of giving a 64-bit dtype (/whatever the dtype of the input data happens to be) array to |
if we fix this, will it break data that people already saved? |
I don't think so - we're still saving valid "blosc data", it's just not very well compressed data. |
Or modify |
Is there a reason |
it's not clear that the typesize parameter should always be set to the size of the dtype of the incoming array. For example, variable-length strings don't have a fixed dtype size, but it would still make sense to pick a good typesize parameter for the blosc compressor. IMO typesize is just a parameter of the compressor like any other, it just happens that in many cases it should default to the size of the dtype of an ndimensional array (if that array has a fixed-size dtype). |
That would be a spec change, which I am not enthusiastic about. |
Given
It seems like the fix is for |
The default should be auto, but i think it should also be exposed as a parameter so users can control it. |
I don't disagree, but if we defer to |
I think zarr-python sh ould set that default and pass it on to numcodecs. But numcodecs needs to have an itemsize arg to actually have that set. Btw itemsize/typesize is not the same as blocksize. |
👍 - just to understand how that would work for a user, they would now have to explicitly set the itemsize on the codec, and if they don't (e.g., following the same way as specifying the blosc codec as in zarr-python 2) it will result in (potentially) a very bad compression (as in the reproducer at the top of this issue)? |
Actually, there is already code for auto-detecting the itemsize: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/codecs/blosc.py#L141-L142 It just doesn't get passed on to numcodec. |
Ah awesome! Potentially it isn't working correctly in {
"name": "blosc",
"configuration": {
"typesize": 8,
"cname": "lz4",
"clevel": 1,
"shuffle": "shuffle",
"blocksize": 0
} Sounds like we have a path forward for a fix at least 🎉 |
cross referencing this thread: #2171 |
8 is correct, because it is in bytes. |
8 bytes is correct, but what's stored in |
Zarr version
3.0.1
Numcodecs version
0.15.0
Python Version
3.13
Operating System
Linux
Installation
Using pip into virtual environment
Description
Writing the same data using identical compressor settings in Zarr-Python 2.x and 3.x yields differences in compression results.
Using
zarr==2.18.4, numcodecs==0.13.1
, this produces two chunks each of size 1.7 KiB on disk. The metadata is:Using
zarr==3.0.1, numcodecs==0.15.0
, this produces two chunks of size 30 KiB.Steps to reproduce
MVCE
Run the following code in a 2.x and 3.x environment and inspect contents of
/tmp/foo.zarr
.Additional output
No response
The text was updated successfully, but these errors were encountered: