Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async FSMap objects in zarr.open #2774

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented Jan 28, 2025

This is rough code, but I made some progress on supporting FSMap types and wanted to open a PR for early feedback. This isn't a priority for me, so I'd welcome anyone to take over this PR and/or close it and work in a different PR.

Addresses #2706

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 28, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 28, 2025

I recall there being issues with fsmap before, but I confess I don't really know what an fsmap is -- can someone explain what an fsmap is, how it differs from an fsspec filesystem, and why people would use one over the other?

cc @martindurant

I have a vague feeling that it could be useful to have a Store class that wraps generic MutableMapping instances, and maybe fsmaps could go there, but that requires me knowing more about the user context for fsmap.

@martindurant
Copy link
Member

FSMap was created specifically for the needs of Zarr, and it could have been essentially the same as the v2 FSStore, but was much quicker to get out and working within dask/fsspec.

FSMap is a dict-compatible interface (mutable-mapping) on top of a FS instance, which zarr worked with since forever and ignores some FS functionality like the file-like API.

To make it work with v3 might be complex, because zarr makes async calls, and the FSMap interface is blocking, even if the underlying FS is async. That means that there will be sync() within sync(), which might still work as zarr maintains its own event loop in a thread separate from fsspec's.

@martindurant
Copy link
Member

To be clear: this PR does not use the mapper, but constructs a normal store from the mapper's attributes. I support this path.

@maxrjones
Copy link
Member Author

thanks for the questions and answers, Davis and Martin!

IIUC there are three cases that we'd need to account for:

  1. FSMap wraps an async instance of an async-compatible filesystem
    Solution - as implemented in this PR, extract the wrapped filesystem and use it to open an FsspecStore
  2. FSMap wraps a non-async instance of a non-async-compatible filesystem
    Solution option 1 - extract the wrapped filesystem and wrap in AsyncFileSystemWrapper to open an FsspecStore as in Wrap sync fs for xarray.to_zarr #2533
    Solution option 2 - if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore, for other protocols, wrap in AsyncFileSystemWrapper to open an FsspecStore
  3. FSMap wraps a non-async instance of an async-compatibile filesystem
    Solution - this is the case I'm not sure about. Is there a way to convert from a sync instance to an async instance without needing to wrap it in AsyncFileSystemWrapper? @martindurant could you please offer guidance here?

@martindurant
Copy link
Member

Is there a way to convert from a sync instance to an async instance without needing to wrap it in AsyncFileSystemWrapper

The instance has all the arguments it was made with as attributes, so you can make a new instance with asynchronous=True from those.

if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore

Is there a reason to bother doing this?

@maxrjones
Copy link
Member Author

maxrjones commented Jan 28, 2025

if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore

Is there a reason to bother doing this?

I was thinking based on https://filesystem-spec.readthedocs.io/en/latest/async.html#limitations that LocalStore may be faster since it was designed around providing an async interface.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 28, 2025

IMO it's simpler if the path is always fsmap -> fsspecstore

@martindurant
Copy link
Member

LocalStore may be faster since it was designed around providing an async interface.

I doubt it. The disk is not really async (at least with the standard syscalls python uses), so none of the async code should make any difference for local reads at all.

@maxrjones
Copy link
Member Author

maxrjones commented Feb 7, 2025

I ran into #2808 when trying to wrap sync filesystems. My preference for this PR would be to bring the codecov up to 100% and request that this be considered for review, with the plan that the conversion of sync instances to async and the wrapping of sync filesystems could be handled separately. My reasoning is that some people could use this current implementation and I'm not sure how quickly I'll be able to find time to add the other missing features.

Update: I added conversion of sync instances to async

@maxrjones maxrjones changed the title WIP: Support fsspec mutable mapping objects in zarr.open Support async FSMap objects in zarr.open Feb 13, 2025
@maxrjones maxrjones marked this pull request as ready for review February 13, 2025 01:44
@dcherian
Copy link
Contributor

@martindurant are you able to give this a review today? We'd like to get out a release with this bugfix asap.

@maxrjones
Copy link
Member Author

maxrjones commented Feb 14, 2025

FYI I believe zarr.open should now work with both async and sync instances of async filesystem implementations that are created using get_mapper (e.g., s3, gcs). However, local filesystems seem broken with details in #2808. TBH I'm not sure if there are other sync implementations and whether they work. This bug already exists for the from_url approach as far as I can tell (I'm not sure if xarray gets around this or it just wasn't caught by tests). I'm not sure how to approach the fact that the bug would be exposed more to users with the new from_mapper function. Perhaps we should raise NotImplementedError for sync implementations in from_wrapper to prevent more people from running into the bug?

@dcherian
Copy link
Contributor

Perhaps we should raise NotImplementedError for sync implementations in from_wrapper to prevent more people from running into the bug?

This sounds good to me. It's better to raise informative error messages when we can.

@martindurant
Copy link
Member

how to approach the fact that the bug would be exposed more to users with the new from_mapper function

These people are currently broken, so I don't think that giving them a "doesn't work" error is any better.

@maxrjones
Copy link
Member Author

@martindurant I get some of these warnings if I only run the tests in test/test_store/test_fsspec.py, but not if I run the full test suite. Do you have advice on whether/how this should be avoided?

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x11f2b5d10>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x118259a90>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x107b0ec10>

@maxrjones
Copy link
Member Author

how to approach the fact that the bug would be exposed more to users with the new from_mapper function

These people are currently broken, so I don't think that giving them a "doesn't work" error is any better.

Your advice in #2808 (comment) works - in this PR we now raise an informative error if auto_mkdir=True is not set in a LocalFilesystem.

@martindurant
Copy link
Member

I suppose that's OK, but why doesn't the store make directories? Of course, for object stores it doesn't matter...

@maxrjones
Copy link
Member Author

FYI I believe this is ready for review, ideally by @martindurant

@martindurant
Copy link
Member

There is one failure due to a not-captured warning. It will need to be handled to get the greens.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments amount to only nitpicks.

@@ -228,7 +234,7 @@ def __eq__(self, other: object) -> bool:


async def make_store_path(
store_like: StoreLike | None,
store_like: StoreLike | FSMap | None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this fail if fsspec isn't present, because FSMap isn't imported? Maybe you need

if _has_fsspec:
    from fsspec.mapping import FSMap
else:
    FSMap = None

or make fsspec a required dependency.

@@ -311,9 +317,18 @@ async def make_store_path(
# We deliberate only consider dict[str, Buffer] here, and not arbitrary mutable mappings.
# By only allowing dictionaries, which are in-memory, we know that MemoryStore appropriate.
store = await MemoryStore.open(store_dict=store_like, read_only=_read_only)
elif _has_fsspec:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you

Suggested change
elif _has_fsspec:
elif _has_fsspec and isinstance(store_like, FSMap):

and then you fall to the same exception in else rather than having the Type Error twice

else:
msg = f"Unsupported type for store_like: '{type(store_like).__name__}'" # type: ignore[unreachable]
raise TypeError(msg)
raise (TypeError(f"Unsupported type for store_like: '{type(store_like).__name__}'"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter requires the redundant outer parentheses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is not covered; switching the condition as I suggested above would make it so, or making a test explicitly with a bad type of store.

Copy link
Member Author

@maxrjones maxrjones Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check on the linter issue, as for the bad coverage I think we still need this in case fsspec isn't installed. #2872 should fix the code coverage gap, but it's taking longer than I expected to finish that PR because it uncovered other issues with the test matrix design (I'm working on it but might not finish today)

@rossbar
Copy link
Contributor

rossbar commented Feb 28, 2025

A quick note - I tried this PR on the example I dropped in #2706 and got the following:

# See linked example for MRE setup
>>> z = zarr.group(store=store)
Traceback (most recent call last)
   ...
AttributeError: module 'fsspec.implementations' has no attribute 'asyn_wrapper'

Now, there may very well be a problem with my example rather than this PR - all I can say is that the example does work with zarr v2.

@maxrjones
Copy link
Member Author

A quick note - I tried this PR on the example I dropped in #2706 and got the following:

# See linked example for MRE setup
>>> z = zarr.group(store=store)
Traceback (most recent call last)
   ...
AttributeError: module 'fsspec.implementations' has no attribute 'asyn_wrapper'

Now, there may very well be a problem with my example rather than this PR - all I can say is that the example does work with zarr v2.

Thanks for the feedback! Could you please which version of fsspec you're using?

@rossbar
Copy link
Contributor

rossbar commented Mar 1, 2025

Thanks for the feedback! Could you please which version of fsspec you're using?

The fsspec version is 2025.2.0: full output of pip list below!

env
Package                 Version
----------------------- ---------------------
asttokens               3.0.0
asyncssh                2.20.0
cffi                    1.17.1
crc32c                  2.7.1
cryptography            44.0.1
decorator               5.2.1
Deprecated              1.2.18
donfig                  0.8.1.post1
executing               2.2.0
fsspec                  2025.2.0
ipython                 9.0.0
ipython_pygments_lexers 1.1.1
jedi                    0.19.2
matplotlib-inline       0.1.7
numcodecs               0.15.1
numpy                   2.2.3
packaging               24.2
parso                   0.8.4
pexpect                 4.9.0
pip                     25.0
prompt_toolkit          3.0.50
ptyprocess              0.7.0
pure_eval               0.2.3
pycparser               2.22
Pygments                2.19.1
PyYAML                  6.0.2
sshfs                   2025.2.0
stack-data              0.6.3
traitlets               5.14.3
typing_extensions       4.12.2
wcwidth                 0.2.13
wrapt                   1.17.2
zarr                    3.0.5.dev24+gd4d22560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants