-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An alternative approach to variable-length chunks: array concatenation within Zarr #2536
Comments
Love this framing it in terms of how can we solve our problem @rabernat! |
Had started doing some refactoring of Happy to put that in a PR It would be nice to get some cleanup out of the way first. Namely this PR: zarr-developers/numcodecs#656 |
Thanks @jakirkham! The way I see it, this proposal does not interact with numcodecs in any way. It doesn't require any special support from codecs, VLen*, etc. It sits higher up in the stack. Sorry if that was not clear. |
thanks for posting this @rabernat, I think this is a great idea for two reasons: first, the proposal itself is straightforward and should be useful to lots of people, and second, implementing it will require healthy changes to how The current model of An alternative model would be bottom-up: we start by defining the smallest possible zarr array (a single chunk), and then define a concatenation operation that allows us to construct larger arrays by combining smaller arrays. In this model, it's zarr arrays all the way down, just arranged in different ways. This would naturally extend to the model you propose here, and open up a lot of other opportunities. Very exciting! |
Thanks Ryan, your approach makes sense to me. Do you think it could extend to having variable length chunks for multiple dimensions? I think this scenario is very much the exception, but a use case does exist. |
Yes I think that would work fine. As long as the arrays (not the chunks themselves) have concatenate-able shapes. |
We've discussed variable-length chunks at length (🥁🤣) in many places, and have an implementation based on ZEP3 in a PR #1595. While the ZEP / extension process remains broken, I'd like to propose we experiment with some implementations of different approaches to the problem within Zarr Python. I'm deliberately avoiding framing this discussion in terms of how should we change / extend the spec and instead framing it in terms of how can we solve our problem, with the idea that the spec would follow. If we like this idea, we can try to just implement it and make it work first, and then we can figure out what sort of extension or spec might be required.
The main reason that many people in our community want variable length chunks is so we can use them together with kerchunk / virtualizarr to create virtual Zarr data on top of existing HDF5 files. However, HDF5 files, like Zarr, also only support fixed-shape chunks. So why do we need variable length chunks?
Because, in addition to virtualizing a single file, we are also concatenating many individual files into one virtual Array.
So what if we just built the ability to address multiple arrays within a Zarr group as a single array?
Imagine we have two arrays
array-0
andarray-1
of the same dtype and concatenate-able shapes, e.g. (2, 50, 10) and (3, 50, 10). And these arrays are stored in the same group:We can concatenate them along axis 0 and obtain an array of shape (5, 50, 10). This can happen completely lazily. All that's required is a bit of indexing logic to map requests to the concatenated array to the appropriate indexes within the sub arrays. I'm sure many of us have written code like this already. That code could live in Zarr Python as a new type of Array, implementing mostly the same interface as our existing Array. (Is it writeable? I don't see why it can't be tbh.) Concatenation could be done for any Zarr arrays that have compatible dtypes and shapes. It's completely lazy and cheap.
The next step is to define some metadata in
my-array-group
that tells us to do this automatically for the arrays in the group. For example, something like:With an appropriate configuration system and API, when opening such a group, we could automatically present it as an Array to downstream libraries, e.g. Xarray.
There are many variants of this metadata, and many questions about how it should work:
"array_name_template": "array-{i}"
Compared to variable-length chunks, this proposal operates at a higher level of abstraction. The advantages over variable length chunks are:
The main downsides I see are:
The ability to quickly load lots of metadata for different arrays, either through async requests, consolidated metadata, or Icechunk, mitigates these issues considerably for high-latency stores.
For a scenario like zarr-developers/VirtualiZarr#330, we wouldn't necessarily have to do
one file : one array
. We could bundle as many regularly-sized chunks as we can into a single virtual array, and then when we hit a chunk discontinuity, start a new virtual array. It's very flexible.What I like about this proposal is that we can implement the hard part--the actual concatenation logic--without any spec changes at all. We could manually concatenate arrays, check the performance, get comfortable with the tradeoffs, etc. before doing any spec work.
cc @TomNicholas, @abarciauskas-bgse.
The text was updated successfully, but these errors were encountered: