Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing getsize_prefix further #686

Closed
rabernat opened this issue Feb 6, 2025 · 0 comments · Fixed by #697
Closed

Optimizing getsize_prefix further #686

rabernat opened this issue Feb 6, 2025 · 0 comments · Fixed by #697
Assignees

Comments

@rabernat
Copy link
Contributor

rabernat commented Feb 6, 2025

One great thing about Icechunk is that the manifests store info about all of the chunks, making it in theory fast to know the on-disk size of very large arrays. (This is slow with vanilla Zarr because it requires listing the object store.)

Today I discovered we are not quite there yet though.

import icechunk as ic
import xarray as xr

storage = ic.s3_storage(
    bucket="icechunk-public-data",
    prefix="v01/era5_weatherbench2",
    region="us-east-1",
    anonymous=True,
)

repo = ic.Repository.open(storage=storage)
session = repo.readonly_session(branch="main")

group = zarr.open_group(session.store, zarr_format=3, mode="r")
a = group['1x721x1440/2m_temperature']
a.info_complete()

I never had the patience to let the last line finish after waiting minutes.

From @paraseba on slack

basically we implemented getsize that takes a key, so python is still listing all keys and calling this ... which is slow because python <-> rust...
it's the switching between python and rust back and forth for every key that is slow

To fully optimize this, I think we need to implement the store.getsize_prefix method in Rust world.

@paraseba paraseba self-assigned this Feb 7, 2025
@paraseba paraseba moved this to In review in Icechunk 1.0 Feb 7, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in Icechunk 1.0 Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants