Skip to content

Feature/zarr #675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 3, 2025
Merged

Feature/zarr #675

merged 15 commits into from
Jun 3, 2025

Conversation

Oisin-M
Copy link
Contributor

@Oisin-M Oisin-M commented Apr 9, 2025

Implements a simple zarr source for reading zarr files (only version 2 at the moment)

@Oisin-M
Copy link
Contributor Author

Oisin-M commented Apr 9, 2025

Tests are failing because zarr version 3 requires Python version 3.11 onward

@sandorkertesz
Copy link
Collaborator

Hi @Oisin-M, thank you very much for this development. I had a quick look at the code and it seems to me that the "zarr" source is not really needed.

@chpolste
Copy link
Member

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

@sandorkertesz
Copy link
Collaborator

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

@chpolste, thank you for the explanation. I agree, this should be a source, just like e.g. "opendap" .

Copy link
Collaborator

@sandorkertesz sandorkertesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this new source! Once it is in sync with the latest develop it can be merged.

@Oisin-M
Copy link
Contributor Author

Oisin-M commented May 26, 2025

@sandorkertesz synced with develop now. Just to flag

  • zarr 3 needs python >= 3.11
  • I added zarr to the pip install .[all] install, so it will complain about not finding a version if using an older python version
  • downstream-ci is failing because it uses python 3.10

If you think it's better, can remove from the .[all] for now

@sandorkertesz
Copy link
Collaborator

sandorkertesz commented Jun 2, 2025

@Oisin-M unfortunately we cannot afford disabling all the tests due to this dependency. So I suggest you remove the zarr dependency from environment-unit-tests.yml and from "all" in pyproject.toml. It will be re-enabled once the CI python version is updated.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 50.00000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 90.92%. Comparing base (75f086e) to head (bc17b5e).

Files with missing lines Patch % Lines
tests/sources/test_zarr.py 50.00% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #675      +/-   ##
===========================================
- Coverage    90.98%   90.92%   -0.06%     
===========================================
  Files          167      168       +1     
  Lines        12663    12679      +16     
  Branches       613      614       +1     
===========================================
+ Hits         11521    11529       +8     
- Misses         960      967       +7     
- Partials       182      183       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Oisin-M
Copy link
Contributor Author

Oisin-M commented Jun 2, 2025

Hi @sandorkertesz, thanks for taking a look. I have now made zarr optional and also made the zarr test automatically skip if it can't find the zarr package, so tests are all passing

@sandorkertesz sandorkertesz merged commit 1eae98b into develop Jun 3, 2025
245 of 246 checks passed
@sandorkertesz
Copy link
Collaborator

sandorkertesz commented Jun 10, 2025

@Oisin-M , I have a question about the source name. What was the reason of choosing "xarray_zarr" and not simply "zarr"? E.g. the "opendap" source is also using xarray to open the data but it does not have "xarray" in the name.

@Oisin-M
Copy link
Contributor Author

Oisin-M commented Jun 11, 2025

Hi @sandorkertesz, this was to leave room for having a reader based on the zarr API directly

@sandorkertesz
Copy link
Collaborator

@Oisin-M, for simplicity, I suggest that the name should be just "zarr". If you want to add a zarr API based reader in the future it could be controlled by adding a key e.g. "reader" to "from_source". E.g.

from_source("zarr", ..., reader="xarray")
from_source("zarr", ..., reader="zarr")

What do you think?

@Oisin-M
Copy link
Contributor Author

Oisin-M commented Jun 13, 2025

Hi @sandorkertesz, I'd be happy to move towards that as long as we return objects with the same methods for both readers

@sandorkertesz
Copy link
Collaborator

Well that is the catch. Not sure what from_source("zarr", ..., reader="zarr") would return. But I still think that the first argument in from_source() should describe the source and not the way we read it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants