Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiZarrToZarr does not pass target_options to the reference filesystems #522

Open
reweeden opened this issue Oct 23, 2024 · 3 comments
Open

Comments

@reweeden
Copy link

reweeden commented Oct 23, 2024

Hi there! I was trying to get MultiZarrToZarr to work on a set of files that have been gzip compressed. Fsspec's fsspec.open function has a compression parameter that can be passed to tell fsspec to decompress the files on the fly. I have gotten this to work using zarr files by doing something like this:

fsspec.filesystem(
    "reference",
    fo="s3://bucket/my-compressed-zarr-file.json.gz",
    target_options={'compression': 'gzip'},
    remote_protocol='s3',
    remote_options={
        'anon': False,
    },
)

I was hoping that this would also work with MultiZarrToZarr and be equally as simple as passing the target_options, however, it seems that while MultiZarrToZarr opens the files using the target_options here to get the file name:

for of in fsspec.open_files(self.path, **self.target_options):

It then does not pass the target options to the fsspec.filesystem call here:

fsspec.filesystem(
"reference",
fo=fo,
remote_protocol=self.remote_protocol,
remote_options=self.remote_options,
)
for fo in fo_list

Is this a bug or is there some reason why the target_options can't be passed to fsspec.filesystem there?

Analysis of Compressed files

From what I can tell by running this with a debugger, this is what's happening:

  1. The call to open_files here succeeds:
    for of in fsspec.open_files(self.path, **self.target_options):
  2. The call to fs.cat does not include the target_options so the file is not decompressed when read:
    fo_list = fs.cat(self.path)
  3. The json decoding fails because it's trying to decode the compressed data and the fo_list is set to the original list of filenames.
    fo_list = self.path
  4. The fsspec.filesystem call opens the files again, but since the target_options are not passed in, the data is not decompressed and the json decoding fails:
    fsspec.filesystem(

Solution

I think there are 2 additional places that the target options need to be passed in

  1. In the fs.cat call via the **kwargs parameter:
    fo_list = fs.cat(self.path)

API Docs for fs.cat: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem
Source code for fs.cat: https://github.com/fsspec/filesystem_spec/blob/4517882f67d635d50b54cd53fd04ee3a37b6943c/fsspec/spec.py#L844

EDIT: After trying this out, it seems that the s3fs implementation for cat_file doesn't work the way the syncronous abstract class does where the kwargs are passed to a call to fs.open. The s3fs _cat_file doesn't support kwargs at all: https://github.com/fsspec/s3fs/blob/f3f63cbfbfe71a4355abd63cafd8c678c4a5a0af/s3fs/core.py#L1113

  1. In the fs.filesystem call:
    fsspec.filesystem(

Workaround

I believe I can work around this by opening the files myself and passing in the zarr dictionaries directly. It's just more code for me to write :)

@martindurant
Copy link
Member

OK, so target_options in your case is something that needs to go to open(), not something that the whole target filesystem has configured. Since the file in question was just opened correctly a couple of lines above, I'm not sure why it's being unbundled like this.

with of as f_list:
     fo_list = [ujson.loads(v) for v in f_list]

@reweeden
Copy link
Author

reweeden commented Oct 29, 2024

I was looking through the repo history and it seems that in the past the of objects returned by fsspec.open were passed to the filesystem call directly. But then at some point it was changed to only pass the of.full_name attributes instead, causing the filesystem to have to re-open the files.

@martindurant
Copy link
Member

Please do make a PR to pass the options for now. We can consider whether it's possible to fuse the open() calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants