Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An alternative approach to variable-length chunks: array concatenation within Zarr #2536

Open
rabernat opened this issue Dec 5, 2024 · 6 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Dec 5, 2024

We've discussed variable-length chunks at length (🥁🤣) in many places, and have an implementation based on ZEP3 in a PR #1595. While the ZEP / extension process remains broken, I'd like to propose we experiment with some implementations of different approaches to the problem within Zarr Python. I'm deliberately avoiding framing this discussion in terms of how should we change / extend the spec and instead framing it in terms of how can we solve our problem, with the idea that the spec would follow. If we like this idea, we can try to just implement it and make it work first, and then we can figure out what sort of extension or spec might be required.

The main reason that many people in our community want variable length chunks is so we can use them together with kerchunk / virtualizarr to create virtual Zarr data on top of existing HDF5 files. However, HDF5 files, like Zarr, also only support fixed-shape chunks. So why do we need variable length chunks?

Because, in addition to virtualizing a single file, we are also concatenating many individual files into one virtual Array.

So what if we just built the ability to address multiple arrays within a Zarr group as a single array?

Imagine we have two arrays array-0 and array-1 of the same dtype and concatenate-able shapes, e.g. (2, 50, 10) and (3, 50, 10). And these arrays are stored in the same group:

my-store/
├─ my-array-group/
│  ├─ zarr.json
│  ├─ array-0/
│  |  ├─ zarr.json
│  |  ├─ c/0/0
│  |  ├─ c/0/2
│  ├─ array-1/
│  |  ├─ zarr.json
│  |  ├─ c/0/0
│  |  ├─ c/0/2
│  |  ├─ c/0/3

We can concatenate them along axis 0 and obtain an array of shape (5, 50, 10). This can happen completely lazily. All that's required is a bit of indexing logic to map requests to the concatenated array to the appropriate indexes within the sub arrays. I'm sure many of us have written code like this already. That code could live in Zarr Python as a new type of Array, implementing mostly the same interface as our existing Array. (Is it writeable? I don't see why it can't be tbh.) Concatenation could be done for any Zarr arrays that have compatible dtypes and shapes. It's completely lazy and cheap.

The next step is to define some metadata in my-array-group that tells us to do this automatically for the arrays in the group. For example, something like:

{
  "extensions": {
    "name": "concatenation",
    "configuration": {
      "axis": 0,
      "arrays": ["array-0", "array-1"]
    }
  }
}

With an appropriate configuration system and API, when opening such a group, we could automatically present it as an Array to downstream libraries, e.g. Xarray.

There are many variants of this metadata, and many questions about how it should work:

  • Do we explicitly list all the array names we want to concat, as in my approach?
  • Or do we just a naming convention for the arrays themselves, like we do for chunks?
  • Or do we use a template approach, like "array_name_template": "array-{i}"
  • Do we store information about the dtype, shape, etc in this metadata?

Compared to variable-length chunks, this proposal operates at a higher level of abstraction. The advantages over variable length chunks are:

  • No change required to array spec itself
  • Less metadata required to keep track of (enumerating the length of each chunk could be expensive)
  • Can concatenate arrays with different codecs and potentially different dtypes into one array
  • Potentially supports multidimensional concatenation
  • The sub arrays are just regular Zarr arrays, creating a falllback mechanism for implementations that don't support concatenation.

The main downsides I see are:

  • Array shape is implicit, requires reading all of the sub-array metadata
  • Indexing into the data is more expensive because it requires more logic and metadata reads to figure out where the data actually live

The ability to quickly load lots of metadata for different arrays, either through async requests, consolidated metadata, or Icechunk, mitigates these issues considerably for high-latency stores.

For a scenario like zarr-developers/VirtualiZarr#330, we wouldn't necessarily have to do one file : one array. We could bundle as many regularly-sized chunks as we can into a single virtual array, and then when we hit a chunk discontinuity, start a new virtual array. It's very flexible.

What I like about this proposal is that we can implement the hard part--the actual concatenation logic--without any spec changes at all. We could manually concatenate arrays, check the performance, get comfortable with the tradeoffs, etc. before doing any spec work.

cc @TomNicholas, @abarciauskas-bgse.

@norlandrhagen
Copy link

Love this framing it in terms of how can we solve our problem @rabernat!

@jakirkham
Copy link
Member

Had started doing some refactoring of VLen* in Numcodecs locally, which may help with extending it in different ways in the future

Happy to put that in a PR

It would be nice to get some cleanup out of the way first. Namely this PR: zarr-developers/numcodecs#656

@rabernat
Copy link
Contributor Author

rabernat commented Dec 6, 2024

Thanks @jakirkham! The way I see it, this proposal does not interact with numcodecs in any way. It doesn't require any special support from codecs, VLen*, etc. It sits higher up in the stack. Sorry if that was not clear.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 6, 2024

thanks for posting this @rabernat, I think this is a great idea for two reasons: first, the proposal itself is straightforward and should be useful to lots of people, and second, implementing it will require healthy changes to how zarr-python represents chunks.

The current model of zarr-python is that an array is backed by a dynamically resolved mapping from array indices to all of the chunks associated with that array, as per the zarr metadata. This is a "top-down" approach -- we start with the full-sized array, and we find the chunks for that array as needed, and we model those chunks as numpy / cupy arrays.

An alternative model would be bottom-up: we start by defining the smallest possible zarr array (a single chunk), and then define a concatenation operation that allows us to construct larger arrays by combining smaller arrays. In this model, it's zarr arrays all the way down, just arranged in different ways. This would naturally extend to the model you propose here, and open up a lot of other opportunities. Very exciting!

@abarciauskas-bgse
Copy link

Thanks Ryan, your approach makes sense to me.

Do you think it could extend to having variable length chunks for multiple dimensions? I think this scenario is very much the exception, but a use case does exist.

@rabernat
Copy link
Contributor Author

rabernat commented Dec 7, 2024

Yes I think that would work fine.

As long as the arrays (not the chunks themselves) have concatenate-able shapes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants