A Python package for accessing hydrological datasets with a unified API, optimized for deep learning workflows.
- π Unified Interface: Consistent API across 20+ hydrological datasets
- β‘ Fast Access: NetCDF caching for instant data loading
- π― Standardized Variables: Common naming across all datasets
- π Built on AquaFetch: Powered by the comprehensive AquaFetch backend
- π ML-Ready: Optimized for integration with torchhydro
This library has been redesigned to serve as a powerful data-adapting layer on top of the AquaFetch package.
While AquaFetch handles the complexities of downloading and reading numerous public hydrological datasets, hydrodataset takes the next step: it standardizes this data into a clean, consistent NetCDF (.nc) format. This format is specifically optimized for seamless integration with hydrological modeling libraries like torchhydro.
The core workflow is:
- Fetch: Use a
hydrodatasetclass for a specific dataset (e.g.,CamelsAus). - Standardize: It uses
AquaFetchas the primary backend for fetching raw data, while maintaining a consistent, unified interface across all datasets. - Cache: On the first run,
hydrodatasetprocesses the data into anxarray.Datasetand saves it as.ncfiles for timeseries and attributes separately in a specified local directory set inhydro_setting.ymlin the user's home directory. - Access: All subsequent data requests are read directly from the fast
.nccache, giving you analysis-ready data instantly.
We strongly recommend using a virtual environment to manage dependencies.
We recommend using uv for fast, reliable package and environment management:
# Install uv if you haven't already
pip install uv
# Install hydrodataset with uv
uv pip install hydrodatasetFor more advanced usage or to work on the project locally:
# Clone the repository
git clone https://github.com/OuyangWenyu/hydrodataset.git
cd hydrodataset
# Create virtual environment and install all dependencies
uv sync --all-extrasThe --all-extras flag installs base dependencies plus all optional dependencies for development and documentation.
If you prefer traditional pip:
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
pip install hydrodatasetThe primary goal of hydrodataset is to provide a simple, unified API for accessing various hydrological datasets. Here's a complete example showing the core workflow:
β οΈ Important Note on First-Time Data DownloadIf you haven't pre-downloaded the datasets, the first access will trigger automatic downloads via AquaFetch, which can take considerable time depending on dataset size:
- Small datasets (< 1GB, e.g., CAMELS-CL, CAMELS-COL): ~10-30 minutes
- Medium datasets (1-5GB, e.g., CAMELS-AUS, CAMELS-BR): ~30 minutes to 1 hour
- Large datasets (10-20GB, e.g., CAMELS-US, LamaH-CE): ~1-3 hours
- Very large datasets (> 30GB, e.g., HYSETS): ~3-6 hours or more
Download times vary based on your internet connection speed and server availability.
We strongly recommend downloading datasets manually during off-peak hours if possible.
After the initial download, all subsequent access will be fast thanks to NetCDF caching.
from hydrodataset.camels_us import CamelsUs
from hydrodataset import SETTING
import os
# All datasets are expected to be in the directory defined in your hydro_setting.yml
# A example of hydro_setting.yml in Windows is like this:
# local_data_path:
# root: 'D:\data\waterism' # Update with your root data directory
# datasets-origin: 'D:\data\waterism\datasets-origin'
# cache: 'D:\data\waterism\cache'
data_path = SETTING["local_data_path"]["datasets-origin"]
# Initialize the dataset class
ds = CamelsUs(data_path)
# 1. Check which features are available
print("Available static features:")
print(ds.available_static_features)
print("Available dynamic features:")
print(ds.available_dynamic_features)
# 2. Get a list of all basin IDs
basin_ids = ds.read_object_ids()
# 3. Read static (attribute) data for a subset of basins
# Note: We use standardized names like 'area' and 'p_mean'
attr_data = ds.read_attr_xrdataset(
gage_id_lst=basin_ids[:2],
var_lst=["area", "p_mean"]
)
print("Static attribute data:")
print(attr_data)
# 4. Read dynamic (time-series) data for the same basins
# Note: We use standardized names like 'streamflow' and 'precipitation'
ts_data = ds.read_ts_xrdataset(
gage_id_lst=basin_ids[:2],
t_range=["1990-01-01", "1995-12-31"],
var_lst=["streamflow", "precipitation"]
)
print("Time-series data:")
print(ts_data)A key feature of the new architecture is the use of standardized variable names. This allows you to use the same variable name to fetch the same type of data across different datasets, without needing to know the specific, internal naming scheme of each one.
For example, you can get streamflow from both CAMELS-US and CAMELS-AUS using the same variable name:
# Get streamflow from CAMELS-US
us_ds.read_ts_xrdataset(gage_id_lst=["01013500"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])
# Get streamflow from CAMELS-AUS
aus_ds.read_ts_xrdataset(gage_id_lst=["A4260522"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])Similarly, you can use precipitation, temperature_max, etc., across datasets. A comprehensive list of these standardized names and their coverage across all datasets is in progress and will be published soon.
hydrodataset currently provides unified access to 27 hydrological datasets across the globe. Below is a summary of all supported datasets:
| Dataset Name | Paper | Temporal Resolution | Data Version | Region | Basins | Time Span | Release Date | Size |
|---|---|---|---|---|---|---|---|---|
| BULL | Paper / Code | Daily | Version 3 (code) / Version 2 (data) | Spain | 484 | 1951-01-02 to 2021-12-31 | 2024-03-10 | 2.2G |
| CAMELS-AUS | Paper (V1) / Paper (V2) | Daily | Version 1 / Version 2 | Australia | 561 | 1950-01-01 to 2022-03-31 | 2024-12 | 2.1G |
| CAMELS-BR | Paper | Daily | Version 1.2 / Version 1.1 | Brazil | 897 | 1980-01-01 to 2024-10-22 | 2025-03-21 | 1.4G |
| CAMELS-CH | Paper | Daily | Version 0.9 / Version 0.6 | Switzerland | 331 | 1981-01-01 to 2020-12-31 | 2025-03-14 | 793.1M |
| CAMELS-CL | Paper | Daily | Dataset | Chile | 516 | 1913-02-15 to 2018-03-09 | 2018-09-28 | 208M |
| CAMELS-COL | Paper | Daily | Version 2 | Colombia | 347 | 1981-05 to 2022-12 | 2025-05 | 80.9M |
| CAMELS-DE | Paper | Daily | Version 1.1 / Version 0.1 | Germany | 1582 | 1951-01-01 to 2020-12-31 | 2025-08-07 | 2.2G |
| CAMELS-DK | Paper | Daily | Version 6.0 | Denmark | 304 | 1989-01-02 to 2023-12-31 | 2025-02-14 | 1.41G |
| CAMELS-FI | Meeting | Yearly/Daily | Version 1.0.1 | Finland | 320 | 1961-01-01 to 2023-12-31 | 2025-07 | 382M |
| CAMELS-FR | Paper | Daily/Monthly/Yearly | Version 3.2 / Version 3 | France | 654 | 1970-01-01 to 2021-12-31 | 2025-08-12 | 364M |
| CAMELS-GB | Paper | Daily | Dataset | United Kingdom | 671 | 1970-10-01 to 2015-09-30 | 2025-05 (new data link) | 244M |
| CAMELS-IND | Paper | Daily | Version 2.2 | India | 472 (242 sufficient flow) | 1980-01-01 to 2020-12-31 | 2025-03-13 | 529.4M |
| CAMELS-LUX | Paper | Hourly/Daily | Version 1.1 | Luxembourg | 56 | 2004-11-01 to 2021-10-31 | 2024-09-27 | 1.4G |
| CAMELS-NZ | Paper | Hourly/Daily | Version 2 / Version 1 | New Zealand | 369 | 1972-01-01 to 2024-08-02 | 2025-08-05 | 4.81G |
| CAMELS-SE | Paper | Daily | Version 1 | Sweden | 50 | 1961-2020 | 2024-02 | 16.19M |
| CAMELS-US | Paper | Daily | Version 1.2 | United States | 671 | 1980-2014 | 2022-06-24 | 14.6G |
| CAMELSH-KR | - | Hourly | Version 1 | South Korea | 178 | 2000-2019 | 2025-03-23 | 3.1G |
| CAMELSH | Paper | Hourly | Version 6 + 3 + 2 | United States | 9008 | 1980-2024 | 2025-08-14 | 4.2G+3.57G+2.18G |
| Caravan-DK | Paper | Daily | Version 7 / Version 5 | Denmark | 308 | 1981-01-02 to 2020-12-31 | 2025-04-11 | 521.6M |
| Caravan | Paper / Code | Daily | Version 1.6 | Global | 16299 | 1950-2023 | 2025-05 | 24.8G |
| EStream | Paper / Code | Daily (weekly, monthly, yearly available) | Version 1.3 / Version 1.1 | Europe | 17130 | 1950-01-01 to 2023-06-30 | 2025-06-30 | 12.3G |
| GRDC-Caravan | Paper | Daily | Version 0.6 / Version 0.2 | Global | 5357 | 1950-2023 | 2025-05-06 | 16.4G |
| HYPE | Paper (draft) | Daily/Monthly/Yearly | Version 1.1 | Costa Rica | 605 | 1985-01-01 to 2019-12-31 | 2020-09-14 | 616.5M |
| HYSETS | Paper / Code | Daily | Dataset (dynamic attributes) | North America | 14425 | 1950-01-01 to 2023-12-31 | 2024-09 | 41.9G |
| LamaH-CE | Paper | Daily/Hourly | Version 1.0 | Central Europe | 859 | 1981-01-01 to 2019-12-31 | 2021-08-02 | 16.3G |
| LamaH-Ice | Paper | Daily/Hourly | Version 1.5 / old version | Iceland | 111 | 1950-01-01 to 2021-12-31 | 2025-08-12 | 9.6G |
| Simbi | Paper | Daily/Monthly | Version 6.0 | Haiti | 24 | 1920-01-01 to 2005-12-31 | 2024-07-02 | 125M |
Access any dataset using the same method calls:
# Same API works for all datasets
ds.read_object_ids() # Get basin IDs
ds.read_attr_xrdataset(...) # Read attributes
ds.read_ts_xrdataset(...) # Read timeseriesFirst access processes and caches data as NetCDF files. All subsequent reads are instant:
- Timeseries data:
{dataset}_timeseries.nc - Attribute data:
{dataset}_attributes.nc - Configured via
~/hydro_setting.yml
Use common names across all datasets:
streamflow- River dischargeprecipitation- Rainfalltemperature_max/temperature_min- Temperature extremespotential_evapotranspiration- PET- And many more...
All data returned as xarray.Dataset objects:
- Labeled dimensions and coordinates
- Built-in metadata and units
- Easy slicing, selection, and computation
- Compatible with Dask for large datasets
The new, unified API architecture is currently in active development.
- Current Implementation: hydrodataset provides access to 27 hydrological datasets (see the Supported Datasets table above). The new unified architecture based on the
HydroDatasetbase class has been fully implemented and tested forcamels_usandcamels_ausdatasets, which serve as reference implementations. - In Progress: We are in the process of migrating all other datasets supported by the library to this new architecture.
- Release Schedule: We plan to release new versions frequently in the short term as more datasets are integrated. Please check back for updates.
This package was created with Cookiecutter and the giswqs/pypackage project template. The data fetching and reading is now powered by AquaFetch.