Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

behavior of is_total_slice for boundary chunks #757

Open
d-v-b opened this issue May 23, 2021 · 10 comments · Fixed by #2784
Open

behavior of is_total_slice for boundary chunks #757

d-v-b opened this issue May 23, 2021 · 10 comments · Fixed by #2784
Labels
bug Potential issues with the zarr-python library V2 Affects the v2 branch

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 23, 2021

The is_total_slice function is used to determine whether a selection corresponds to an entire chunk (is_total_slice -> True) or not (is_total_slice -> False).

I noticed that is_total_slice is always False for "partial chunks", i.e. chunks that are not filled by the array:

import zarr
from zarr.util import is_total_slice
# create an array with 1 full chunk and 1 partial chunk
a = zarr.open('test.zarr', path='test', shape=(10,), chunks=(9,), dtype='uint8', mode='w')
for x in BasicIndexer(slice(None), a):
    print(x.chunk_selection, is_total_slice(x.chunk_selection, a._chunks))

Which prints this:

(slice(0, 9, 1),) True
(slice(0, 1, 1),) False

Although the last selection is not the size of a full chunk, it is "total" with respect to the output of that selection in the array.
A direct consequence of this behavior is unnecessary chunk loading when performing array assignments -- zarr uses the result of is_total_slice to decide whether to load existing chunk data or not. Because is_total_slice is always False for partial chunks, zarr always loads boundary chunks during assignment.

A solution to this would be to augment the is_total_slice function to account for partial chunks. I'm not sure at the moment how to do this exactly, but it shouldn't be hard. Happy to bring forth a PR if people agree that this is an issue worth addressing.

@jakirkham
Copy link
Member

Yeah that seems reasonable

@fandreuz
Copy link

fandreuz commented Oct 8, 2021

I made some tests and I think there is actually no way to reach the desired behavior of is_total_slice since the modification proposed by @d-v-b relies on the relative position of the slice in the array, and this kind of information is not delivered to is_total_slice.

One possible solution would be to add a new (maybe optional) parameter to is_total_slice to communicate the relative position of the slice (i.e. the upper left corner).

In this case I would like to work on the problem.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 8, 2021

@fAndreuzzi you might want to pick up from my efforts here: https://github.com/d-v-b/zarr-python/tree/boundary_chunk_optimization

The strategy I'm taking is to intercept the arguments to is_total_slice with a new function trim_chunks, which basically clamps chunk slices to the bounds of an array. This passes the core tests, but currently fails test_parallel_append with the interprocess lock. There must be a race condition, but I have no idea where its coming from :/

@fandreuz
Copy link

fandreuz commented Oct 9, 2021

@d-v-b Thank you very much for the hints!

@fandreuz
Copy link

fandreuz commented Oct 9, 2021

Just a question to check whether I understood the key idea of zarr.

If we consider the example provided by @d-v-b, I would expect the result of is_total_slice(slice(1,10,1), a._chunks) to be False, since the array is divided into chunks a[0:9] and a[9], therefore a[1:10] is not a chunk. However it looks like this test returns True. Am I doing something wrong?

@dcherian
Copy link
Contributor

This also affects reads. For e.g. when dask chunks exactly overlap with Zarr chunks, it should be possible to decode directly to a memory buffer.

Instead because is_total_slice fails for boundary chunks, all boundary chunks are always uselessly copied one extra time.

It should be easy to tell _process_chunk the expected chunk shape for that particular chunk and pass that to is_total_slice instead of self._chunks (the loop that calls _process_chunk knows the chunk key)

An example profile in which 18 chunks take the fast path, and 9 chunks are uselessly copied in the last line.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2013                                               def _process_chunk(
  2014                                                   self,
  2015                                                   out,
  2016                                                   cdata,
  2017                                                   chunk_selection,
  2018                                                   drop_axes,
  2019                                                   out_is_ndarray,
  2020                                                   fields,
  2021                                                   out_selection,
  2022                                                   partial_read_decode=False,
  2023                                               ):
  2024                                                   """Take binary data from storage and fill output array"""
  2025       126      25000.0    198.4      0.0          if (
  2026        27      58000.0   2148.1      0.0              out_is_ndarray
  2027        27       6000.0    222.2      0.0              and not fields
  2028        27     572000.0  21185.2      0.2              and is_contiguous_selection(out_selection)
  2029        27     244000.0   9037.0      0.1              and is_total_slice(chunk_selection, self._chunks)
  2030        18       8000.0    444.4      0.0              and not self._filters
  2031        18      41000.0   2277.8      0.0              and self._dtype != object
  2032                                                   ):
  2033        18      37000.0   2055.6      0.0              dest = out[out_selection]
  2034                                                       # Assume that array-like objects that doesn't have a
  2035                                                       # `writeable` flag is writable.
  2036        18      16000.0    888.9      0.0              dest_is_writable = getattr(dest, "writeable", True)
  2037        36       8000.0    222.2      0.0              write_direct = dest_is_writable and (
  2038        18      24000.0   1333.3      0.0                  (self._order == "C" and dest.flags.c_contiguous)
  2039                                                           or (self._order == "F" and dest.flags.f_contiguous)
  2040                                                       )
  2041        18       4000.0    222.2      0.0              if write_direct:
  2042                                                           # optimization: we want the whole chunk, and the destination is
  2043                                                           # contiguous, so we can decompress directly from the chunk
  2044                                                           # into the destination array
  2045        18      11000.0    611.1      0.0                  if self._compressor:
  2046        18       7000.0    388.9      0.0                      if isinstance(cdata, PartialReadBuffer):
  2047                                                                   cdata = cdata.read_full()
  2048        18  105409000.0    6e+06     42.5                      self._compressor.decode(cdata, dest)
  2049                                                           else:
  2050                                                               if isinstance(cdata, UncompressedPartialReadBufferV3):
  2051                                                                   cdata = cdata.read_full()
  2052                                                               chunk = ensure_ndarray_like(cdata).view(self._dtype)
  2053                                                               chunk = chunk.reshape(self._chunks, order=self._order)
  2054                                                               np.copyto(dest, chunk)
  2055        18      11000.0    611.1      0.0                  return
  2056                                           


....


  2093         9   80870000.0    9e+06     32.6          chunk = self._decode_chunk(cdata)
  2094                                           
  2095                                                   # select data from chunk
  2096         9       4000.0    444.4      0.0          if fields:
  2097                                                       chunk = chunk[fields]
  2098         9      27000.0   3000.0      0.0          tmp = chunk[chunk_selection]
  2099         9       1000.0    111.1      0.0          if drop_axes:
  2100                                                       tmp = np.squeeze(tmp, axis=drop_axes)
  2101                                           
  2102                                                   # store selected data in output
  2103                                                   # import ipdb; ipdb.set_trace()
  2104         9   60815000.0    7e+06     24.5          out[out_selection] = tmp

@rabernat
Copy link
Contributor

I filed a related issue recently as well

#1730

@dstansby
Copy link
Contributor

Do folks think addressing this is a bugfix, or an enhancement?

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 28, 2024

i think it's a bug if we are still reading boundary chunk unnecessarily, but i'm not sure if this is happening in v3

@dstansby dstansby added the bug Potential issues with the zarr-python library label Dec 28, 2024
@dstansby dstansby added the V2 Affects the v2 branch label Jan 11, 2025
dcherian added a commit to dcherian/zarr-python that referenced this issue Jan 31, 2025
Uses `slice(..., None)` to indicate that a `chunk_selection`
ends at the boundary of the current chunk. Also does so for a last
chunk that is shorter than the chunk size.

`is_total_slice` now understands this convention, and correctly
detects boundary chunks as total slices.

Closes zarr-developers#757
dcherian added a commit to dcherian/zarr-python that referenced this issue Jan 31, 2025
Uses `slice(..., None)` to indicate that a `chunk_selection`
ends at the boundary of the current chunk. Also does so for a last
chunk that is shorter than the chunk size.

`is_total_slice` now understands this convention, and correctly
detects boundary chunks as total slices.

Closes zarr-developers#757
dcherian added a commit to dcherian/zarr-python that referenced this issue Feb 4, 2025
Uses `slice(..., None)` to indicate that a `chunk_selection`
ends at the boundary of the current chunk. Also does so for a last
chunk that is shorter than the chunk size.

`is_total_slice` now understands this convention, and correctly
detects boundary chunks as total slices.

Closes zarr-developers#757
dcherian added a commit that referenced this issue Feb 12, 2025
* Skip reads when completely overwriting boundary chunks

Uses `slice(..., None)` to indicate that a `chunk_selection`
ends at the boundary of the current chunk. Also does so for a last
chunk that is shorter than the chunk size.

`is_total_slice` now understands this convention, and correctly
detects boundary chunks as total slices.

Closes #757

* normalize in codec_pipeline

* Revert "normalize in codec_pipeline"

This reverts commit 234431cd6efb661c53e2a832a0e4ea4dca772c1b.

* Partially Revert "Skip reads when completely overwriting boundary chunks"

This reverts commit edbba37.

* Different approach

* fix bug

* add oindex property test

* more complex oindex test

* cleanup

* more oindex

* Add changelog entry

* [revert] note

* fix for numpy 1.25

---------

Co-authored-by: Davis Bennett <[email protected]>
@dcherian
Copy link
Contributor

dcherian commented Feb 12, 2025

Re-opening because I haven't looked at whether the read path is still affected.

@dcherian dcherian reopened this Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library V2 Affects the v2 branch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants