Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore optimizations for NDBuffer.all_equal #2730

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

y4n9squared
Copy link

@y4n9squared y4n9squared commented Jan 18, 2025

Zarr 3.x has some performance regressions for certain write workloads (writing large chunks with floating point dtype).

This change modifies the implementation of NDBuffer.all_equal to be the same logic as Zarr 2.x's zarr.util.all_equals, which contains a number of important optimizations. A few mechanical changes were made to accomodate that the subroutine is now a method of NDBuffer rather than function.

This change is most impactful when writing large floating point chunks as the implementation of

np.all(np.isnan(self._data))

is significantly more efficient than calling

_data, other = np.broadcast(self.data, np.nan)
np.array_equal(_data, other, equal_nan=True))

since np.broadcast requires potentially a large allocation -- the size of `self.data -- and then np.array_equal needs to fetch double the number of cache lines.

On EC2 r7i.2xlarge:

In [20]: data = np.random.rand(512, 512, 8)

In [21]: %timeit np.all(np.isnan(data))
596 μs ± 179 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [22]: %%timeit
    ...: data_, other = np.broadcast_arrays(data, np.nan)
    ...: np.array_equal(data_, other, equal_nan=True)
    ...:
    ...:
2.66 ms ± 953 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

(Both numbers are faster on M3 Max but similar slowdown).

With low-latency stores (e.g. local SSD), this results in double-digit % speed-ups for the workload referenced in the Zarr V3 blog post:

import numpy as np
import zarr

za = zarr.create_array(
    /tmp/foo.zarr",
    shape=(512, 512, 512),
    chunks=(512, 512, 8),
    dtype=np.float64,
    overwrite=True,
)

arr = np.random.rand(512, 512, 512)

za[:] = arr

For higher latency stores, improvement is still dramatic (10%+) when chunks have high compression ratios (e.g. np.ones).

For arrays larger than 1 GB, improvement is even more pronounced.

Towards #2710

Zarr 3.x has some performance regressions for certain write workloads
(writing large chunks with floating point dtype).

This change modifies the implementation of `NDBuffer.all_equal` to be
the same logic as Zarr 2.x's `zarr.util.all_equals`, which contains a
number of important optimizations. A few mechanical changes were made
to accomodate that the subroutine is now a method of `NDBuffer` rather
than function.

This change is most impactful when writing large floating point chunks
as the implementation of

```python
np.all(np.isnan(self._data))
```

is significantly more efficient than calling

```python
_data, other = np.broadcast(self.data, np.nan)
np.array_equal(_data, other, equal_nan=True))
```

since `np.broadcast` requires potentially a large allocation -- the
size of `self.data -- and then np.array_equal needs to fetch double the
number of cache lines.

On EC2 r7i.2xlarge:

```
In [20]: data = np.random.rand(512, 512, 8)

In [21]: %timeit np.all(np.isnan(data))
596 μs ± 179 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [22]: %%timeit
    ...: data_, other = np.broadcast_arrays(data, np.nan)
    ...: np.array_equal(data_, other, equal_nan=True)
    ...:
    ...:
2.66 ms ± 953 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

(Both numbers are faster on M3 Max but similar slowdown).

With low-latency stores (e.g. local SSD), this results in double-digit %
speed-ups for the workload referenced in the Zarr V3 blog post:

```
import numpy as np
import zarr

za = zarr.create_array(
    /tmp/foo.zarr",
    shape=(512, 512, 512),
    chunks=(512, 512, 8),
    dtype=np.float64,
    overwrite=True,
)

arr = np.random.rand(512, 512, 512)

za[:] = arr
```

For higher latency stores, improvement is still dramatic (10%+) when
chunks have high compression ratios (e.g. np.ones).

For arrays larger than 1 GB, improvement is even more pronounced.
@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

can we get a test for each conditional branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants