-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file type discovery for the parquet format #519
Comments
Given the new manifests coming, I'm not sure if this is a good time to do this.
|
I actually agree. Changing the file format would only help people who write new references in the kerchunk format, but now that icechunk exists I would recommend that people write references in that format from now on instead. |
You still need to kerchunk-scan the original files at some point - how would you write these references? |
all I'd really need is some way to decide whether a directory named ".parquet" is a "kerchunk reference file". If we can rely on Otherwise I agree that if we have that we don't really need another file and can just put the version info in there (something like
I don't think we'll be able to convince everyone to switch immediately, so having good tool support for the existing formats is still important. |
Yes, I think so |
If you use VirtualiZarr then Kerchunk gets called to do the scanning, then the references are in memory (as wrapped import virtualizarr as vz
vds = vz.open_virtual_dataset(file.nc) # uses kerchunk's SingleHDF5ToZarr under the hood, references are now in-memory
vds.virtualize.to_icechunk(icechunkstore) # writes directly to Icechunk's on-disk format |
In zarr-developers/VirtualiZarr#251 (comment), we've been discussing the detection of the parquet reference files.
This turns out not to be easy: while the directory usually has a
.parquet
suffix, the directory structure does not lend itself to quick checks.Since the format currently does not have version information, I wonder if it would be possible to change
fsspec
'sLazyReferenceMapper
to write a small file (e.g.version.json
orkerchunk.json
) into the root directory of the "zarrquet" file? That would have the advantage of versioning the file format, and would also make the detection easier.cc @norlandrhagen, @TomNicholas
The text was updated successfully, but these errors were encountered: