Skip to content

zero check in encoding less than ideal for lazy arrays #3627

@hmaarrfk

Description

@hmaarrfk

The zero check in the encoding path is less than ideal

if not chunk_spec.config.write_empty_chunks and chunk_array.all_equal(

For lazy arrays, it eventually calls:

  1. np.broadcast_arrays
  2. np.array_equal

where the latter can instantiate a full array in memory.

The problem is that some lazy array implementation will not "keep this instantiation of the interediate full memory array" in memory for the the operation 2 lines down, that will eventually write the array.

So if the instantiation of the lazy array is costly, we are:

  1. Instantiating it once for checking for zeros.
  2. Instantiating a second time to write it to the store.

I would love to avoid this 'double cost' in a "pythonic way".

I would have loved to suggest
My proposal would be to refactor the "zero check" to use np.array_equiv https://numpy.org/devdocs/reference/generated/numpy.array_equiv.html#numpy.array_equiv

>>> import numpy as np
>>> np.array_equiv(np.zeros((3, 4)), 0)
True

but looking at the source, it seems to recreate the same problem https://github.com/numpy/numpy/blob/main/numpy/_core/numeric.py#L2552-L2598

I'm not sure what the right solution is.

One could cache the local array.

I notice that this operation is actually considered inefficient by the original writers leaving a note like

        # use array_equal to obtain equal_nan=True functionality
        # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
        # every single time we have to write data?

Where this is a problem for me:

The numpy broadcast operation is actually really difficult to implement if you want to ensure high performance based on the chunks of the array. My experience with dask from 7 years ago reminds me that it was also difficult for them to implement efficient rechunking.

So implementing the np.broadcast_arrays just feels like "alot of work"....

_data, other = np.broadcast_arrays(self._data, other)

The zero check feels like it makes sense.... but its like "alot of work to do" for some array backends.


Suggested patch:

diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index f0d01566..54a2c09f 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -540,6 +540,11 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
         # every single time we have to write data?
+
+        # The operation array_equal operation below effectively will force the array
+        # into memory.
+        # So we cache it here.
+        self._data = np.asarray(self._data)
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
             self._data,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions