-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested HDF5 Data / HEC-RAS #490
Comments
Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think. The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.
This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class. |
>>> import h5py
>>> import xarray as xr
>>> from kerchunk.hdf import SingleHdf5ToZarr
>>> ras_h5 = h5py.File("/mnt/c/temp/ElkMiddle.p01.hdf", "r")
>>> zmeta = SingleHdf5ToZarr(ras_h5, "file:///mnt/c/temp/ElkMiddle.p01.hdf").translate()
>>> import json
>>> with open("ElkMiddle.p01.hdf.json", "w") as z:
... z.write(json.dumps(zmeta, indent=2))
...
722087
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}})
>>> ds
<xarray.Dataset> Size: 0B
Dimensions: ()
Data variables:
*empty*
Attributes:
File Type: HEC-RAS Results
File Version: HEC-RAS 6.3.1 September 2022
Projection: PROJCS["USA_Contiguous_Albers_Equal_Area_Conic_USGS_versio...
Units System: US Customary
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}}, group="/Results/Unsteady/Output/Output Blocks/Base Output/Unsteady Time Series/2D Flow Areas/ElkMid
dle")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/api.py", line 588, in open_dataset
backend_ds = backend.open_dataset(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1188, in open_dataset
ds = store_entrypoint.open_dataset(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/store.py", line 58, in open_dataset
ds = Dataset(vars, attrs=attrs)
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 713, in __init__
variables, coord_names, dims, indexes, _ = merge_data_and_coords(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 427, in merge_data_and_coords
return merge_core(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/merge.py", line 705, in merge_core
dims = calculate_dimensions(variables)
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/variable.py", line 3009, in calculate_dimensions
raise ValueError(
ValueError: conflicting sizes for dimension 'phony_dim_1': length 33101 on 'Face Velocity' and length 14606 on {'phony_dim_0': 'Cell Cumulative Precipitation Depth', 'phony_dim_1': 'Cell Cumulative Precipitation Depth'} The structure is complex. Any developer who has worked with RAS data could rant about it but ultimately the point is that a helping hand is needed extract data from RAS HDF5 files into nice xarray objects, hence the rashdf project. We considered using
Combining reference sets is not in the
Seems like a reasonable assumption. Calling methods with leading underscores feels naughty to me, but if those methods are unlikely to change then maybe we're alright. |
Sorry, didn't mean to close this. |
That truly is a gnarly data hierarchy. I wonder whether where kerchunk gets confused, is groups which have both child group(s) and array(s). The zarr model does support that, but I wouldn't be too surprised if we have extra hidden assumptions. Writing specialised versions of the file scanner for a well-understood use case is of course fine, and I can try to help however I can.
It might be worthwhile figuring out what exactly is going wrong with pure kerchunk, but this doesn't touch most current users, as there is a lot of netCDF-compliant data, which is far simpler in structure. If you make interesting reference sets with your code, I'd be happy to see them and even better would be a blog post about your process :) |
Using VirtualiZarr should make this part simpler to handle :)
That all makes sense. FYI soon we should have
That's nasty 😆 But if you already have a (hacky) way of generating kerchunk references on a per-variable basis from an xarray dataset, it should be pretty straightforward to convert that into a virtualizarr "virtual" xarray dataset. You're basically just calling the internal function |
@TomNicholas , do you think the time has come to use Vzarr to do all that MultiZarrToZarr can do? Basically, the existing code in MZZ for finding the coordinates (the "mappers") for each input dataset is still useful, but then building the coordinates and outputting the references need not be duplicated. For cases where there is uneven chunking in the concat dimension(s), we can come up with the best way to represent that output, whichever manifest works. kerchunk could simply depend on VZarr. |
IMO it's very close.
If you're willing to inline the indexed coordinates, then I think this is possible with VZ right now too, see zarr-developers/VirtualiZarr#18 (comment).
Currently we have the opposite: VirtualiZarr depends on kerchunk (for reading, and for writing to kerchunk json/parquet). I plan to make this dependency optional though - the codebase is already factored in such a way that this is effectively optional, it's only the tests that still mix concerns.
It really depends what you want to do with MZZ. In my head, MZZ is effectively already deprecated in favour of using vz and xarray to achieve the same combining operations. You could follow that and literally deprecate MZZ, leaving the kerchunk readers for use in other packages, especially to be called from
This is an orthogonal issue. Once uneven chunking is supported in the zarr spec, both generalizing vz's |
We'd be figuring out the coordindates and putting them somewhere, wherever is most convenient. In many cases, it would be a single value per input data set. As I've said before, I'm indifferent to where the functionality ends up being implemented, so log as it exists! I think that being able to persist kerchunk references (or other manifests) is critical, though, so that a user needs only to open a thing without worrying about further options - I think everything is already in place for this to work. |
I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.
We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new
simulation
dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.The way I implemented it is hacky:
xr.Dataset
pulled from the HDF file and the path of each childxr.DataArray
within the HDF file,filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
storage_info
xr.Dataset
to azarr.MemoryStore
withcompute=False
, to generate the framework of what's needed for the Zarr metadatazarr.MemoryStore
and decodezarr.MemoryStore
objects,filters
, andstorage_info
into a dictionary and finally returnI suppose my questions are:
SingleHdf5ToZarr._decode_filters
and_storage_info
methods be made public?The text was updated successfully, but these errors were encountered: