-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iteration over Zarr arrays is significantly slower in v3 #2529
Comments
I think we just haven't implemented I do think this kind of iteration is intrinsically inefficient for most zarr arrays, because it doesn't take any account of the chunk structure. Even if The most efficient iteration would be to iterate over chunks, then iterate over each chunk. This of course is not possible if we choose to emulate numpy-style iteration. |
I don't follow this - if it iterates over the chunks in the first dimension, then that's about as good as you can do, given the definition of the operation? Something like def iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
for a in A:
yield a (Just FYI, I added the iter functionality to zarr 2 and find it a useful operation) |
Correct me if I'm misunderstanding something -- suppose we have an array The basic problem is that the IO-optimal way of iterating over zarr arrays requires knowledge of how they are chunked, but the behavior of (edit I had transposed my efficient and inefficient chunks) |
to illustrate with some examples from this requires reading every chunk per iteration: >>> [x for x in zarr.zeros((3,2), chunks=(3,1))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])] this requires reading 1 chunk per iteration: >>> [x for x in zarr.zeros((3,2), chunks=(1,2))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])] for isotropic chunk sizes, some caching is better than none, but this kind of iteration will still be inefficient compared to a chunk-aware iteration API |
Sorry, I don't get it (I must be missing something basic here!) def first_dim_iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
for a in A:
yield a Surely the numpy array So this is chunk aware, right? |
That function will potentially load an entire array into memory (see the first case), which we do not want: # /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr==2.18",
# ]
# ///
import zarr
import numpy as np
def first_dim_iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
print('A has shape ', A.shape)
for a in A:
yield a
shape = (8,7)
chunks = ((8,1), (1, 7), (3,3))
for _chunks in chunks:
print('chunks', _chunks)
z = zarr.zeros(shape=shape, chunks=_chunks)
z[:] = np.arange(np.prod(shape)).reshape(*shape)
gen = first_dim_iter(z)
print(next(gen))
print(next(gen)) produces chunks (8, 1)
A has shape (8, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7. 8. 9. 10. 11. 12. 13.]
chunks (1, 7)
A has shape (1, 7)
[0. 1. 2. 3. 4. 5. 6.]
A has shape (1, 7)
[ 7. 8. 9. 10. 11. 12. 13.]
chunks (3, 3)
A has shape (3, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7. 8. 9. 10. 11. 12. 13.] |
it looks like the implementation of Should we add this to v3 and treat it like a "user beware" situation where people should know how the array is chunked before they iterate over it? Personally, I would be unpleasantly surprised if I tried to iterate of a chunked array and incidentally fetched all the chunks (but I work with TB sized data.). I would generally assume that a "big data first" library would keep surprises like that to a minimum. More broadly, I'm not a fan of copying numpy APIs like |
Sorry I'm sure I'm being thick here, but I'm still not getting it. You could only load the entire array into memory if there was only one chunk in the first dimension, right? Having a single chunk in the first dimension of a large array seems quite an odd situation to me? I'm used to working with arrays with a large first dimension, with many chunks, and this pattern works well. I work with TB scale data as well and I entirely agree that unpleasant surprises like loading an entire array into memory behind the scenes shouldn't happen. But surprises like
decompressing each chunk, chunksize times are also not entirely pleasant. |
Let me assure you that this is extremely common. Zarr users exercise every possible allowed permutation of chunk shapes to satisfy different data access patterns. There is nothing special about the first dimension in any way. |
Thanks @rabernat, that's why I'm confused then! But in this one chunk case, isn't the whole array getting loaded into memory anyway each time we access a single element, during iteration? |
In the worst case, the entire array is loaded into memory just once (on the first invocation of I think the following example illustrates it. I'm using a subclass of # /// script
# requires-python = ">=3.10"
# dependencies = [
# "zarr==2.18",
# ]
# ///
import zarr
import numpy as np
class LoudStore(zarr.MemoryStore):
def __getitem__(self, key):
print(f"reading {key}")
return super().__getitem__(key)
shape = (8,7)
chunks = ((shape[0], 1), (1, shape[1]))
for _chunks in chunks:
print(f'\nchunks: {_chunks}')
z = zarr.zeros(shape=shape, chunks=_chunks, store=LoudStore())
z[:] = np.arange(z.size).reshape(shape)
gen = z.islice()
for it in range(2):
print(f'iter {it}:')
print(next(gen)) which produces the following output:
|
@d-v-b in v3 chunks are not cached though, so it's much slower, which is why I opened this issue. This is the worse of both worlds as iteration works, but is terribly inefficient. I think there's a case for implementing |
If you care about the order in which elements come out this iteration is only cheap is if there is a single chunk along the last dimension (for order='C') or the first dimension (for order='F'). Anything else will require some handling of potentially large slices & complicated chunk caches whose behaviour depends on the chunking of the array (I like this image from the dask docs that illustrates the problem). So perhaps we bring it back but just add the chunksize checks to avoid potentially large memory usage. It is always possible to just iterate through blocks, and flatten that and yield elements as long as you do not care about order. Dask just added this recently and it can be quite useful. |
That would be good. |
Zarr version
3.0.0b2
Numcodecs version
0.14.0
Python Version
3.11
Operating System
Mac
Installation
pip
Description
Iterating over elements or slices of a Zarr array
z
usingiter(z)
is much slower in v3 than v2.In v2
Array
implements__iter__
, which caches chunks, whereas in v3Array
does not implement__iter__
soiter(z)
falls back to callingz[0]
,z[1]
, etc, which means that every element or slice has to load the chunk again.This seems like a regression, but perhaps there was a reason for not implementing
__iter__
on Array? The thing is that code will still work in v3, but will be a lot slower (possibly orders of magnitude), and it's quite hard to track down what is happening.(There's an interesting Array API issue about iterating over arrays which may be relevant here: data-apis/array-api#818.)
Steps to reproduce
See sgkit-dev/bio2zarr#288 (comment)
Additional output
No response
The text was updated successfully, but these errors were encountered: