hydrodataset

A Python package for accessing hydrological datasets with a unified API, optimized for deep learning workflows.

🌊 Unified Interface: Consistent API across 20+ hydrological datasets
⚡ Fast Access: NetCDF caching for instant data loading
🎯 Standardized Variables: Common naming across all datasets
🔗 Built on AquaFetch: Powered by the comprehensive AquaFetch backend
📊 ML-Ready: Optimized for integration with torchhydro

Core Philosophy

This library has been redesigned to serve as a powerful data-adapting layer on top of the AquaFetch package.

While AquaFetch handles the complexities of downloading and reading numerous public hydrological datasets, hydrodataset takes the next step: it standardizes this data into a clean, consistent NetCDF (.nc) format. This format is specifically optimized for seamless integration with hydrological modeling libraries like torchhydro.

The core workflow is:

Fetch: Use a hydrodataset class for a specific dataset (e.g., CamelsAus).
Standardize: It uses AquaFetch as the primary backend for fetching raw data, while maintaining a consistent, unified interface across all datasets.
Cache: On the first run, hydrodataset processes the data into an xarray.Dataset and saves it as .nc files for timeseries and attributes separately in a specified local directory set in hydro_setting.yml in the user's home directory.
Access: All subsequent data requests are read directly from the fast .nc cache, giving you analysis-ready data instantly.

Installation

We strongly recommend using a virtual environment to manage dependencies.

Using uv (Recommended)

We recommend using uv for fast, reliable package and environment management:

# Install uv if you haven't already
pip install uv

# Install hydrodataset with uv
uv pip install hydrodataset

For more advanced usage or to work on the project locally:

# Clone the repository
git clone https://github.com/OuyangWenyu/hydrodataset.git
cd hydrodataset

# Create virtual environment and install all dependencies
uv sync --all-extras

The --all-extras flag installs base dependencies plus all optional dependencies for development and documentation.

Using pip (Alternative)

If you prefer traditional pip:

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install hydrodataset

Quick Start

The primary goal of hydrodataset is to provide a simple, unified API for accessing various hydrological datasets. Here's a complete example showing the core workflow:

⚠️ Important Note on First-Time Data Download

If you haven't pre-downloaded the datasets, the first access will trigger automatic downloads via AquaFetch, which can take considerable time depending on dataset size:

Small datasets (< 1GB, e.g., CAMELS-CL, CAMELS-COL): ~10-30 minutes

Medium datasets (1-5GB, e.g., CAMELS-AUS, CAMELS-BR): ~30 minutes to 1 hour

Large datasets (10-20GB, e.g., CAMELS-US, LamaH-CE): ~1-3 hours

Very large datasets (> 30GB, e.g., HYSETS): ~3-6 hours or more

Download times vary based on your internet connection speed and server availability.

We strongly recommend downloading datasets manually during off-peak hours if possible.

After the initial download, all subsequent access will be fast thanks to NetCDF caching.

Basic Example

from hydrodataset.camels_us import CamelsUs
from hydrodataset import SETTING
import os

# All datasets are expected to be in the directory defined in your hydro_setting.yml
# A example of hydro_setting.yml in Windows is like this:
# local_data_path:
#   root: 'D:\data\waterism' # Update with your root data directory
#   datasets-origin: 'D:\data\waterism\datasets-origin'
#   cache: 'D:\data\waterism\cache'
data_path = SETTING["local_data_path"]["datasets-origin"]

# Initialize the dataset class
ds = CamelsUs(data_path)

# 1. Check which features are available
print("Available static features:")
print(ds.available_static_features)

print("Available dynamic features:")
print(ds.available_dynamic_features)

# 2. Get a list of all basin IDs
basin_ids = ds.read_object_ids()

# 3. Read static (attribute) data for a subset of basins
# Note: We use standardized names like 'area' and 'p_mean'
attr_data = ds.read_attr_xrdataset(
    gage_id_lst=basin_ids[:2],
    var_lst=["area", "p_mean"]
)
print("Static attribute data:")
print(attr_data)

# 4. Read dynamic (time-series) data for the same basins
# Note: We use standardized names like 'streamflow' and 'precipitation'
ts_data = ds.read_ts_xrdataset(
    gage_id_lst=basin_ids[:2],
    t_range=["1990-01-01", "1995-12-31"],
    var_lst=["streamflow", "precipitation"]
)
print("Time-series data:")
print(ts_data)

Standardized Variable Names

A key feature of the new architecture is the use of standardized variable names. This allows you to use the same variable name to fetch the same type of data across different datasets, without needing to know the specific, internal naming scheme of each one.

For example, you can get streamflow from both CAMELS-US and CAMELS-AUS using the same variable name:

# Get streamflow from CAMELS-US
us_ds.read_ts_xrdataset(gage_id_lst=["01013500"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])

# Get streamflow from CAMELS-AUS
aus_ds.read_ts_xrdataset(gage_id_lst=["A4260522"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])

Similarly, you can use precipitation, temperature_max, etc., across datasets. A comprehensive list of these standardized names and their coverage across all datasets is in progress and will be published soon.

Supported Datasets

hydrodataset currently provides unified access to 27 hydrological datasets across the globe. Below is a summary of all supported datasets:

Dataset Name	Paper	Temporal Resolution	Data Version	Region	Basins	Time Span	Release Date	Size
BULL	Paper / Code	Daily	Version 3 (code) / Version 2 (data)	Spain	484	1951-01-02 to 2021-12-31	2024-03-10	2.2G
CAMELS-AUS	Paper (V1) / Paper (V2)	Daily	Version 1 / Version 2	Australia	561	1950-01-01 to 2022-03-31	2024-12	2.1G
CAMELS-BR	Paper	Daily	Version 1.2 / Version 1.1	Brazil	897	1980-01-01 to 2024-10-22	2025-03-21	1.4G
CAMELS-CH	Paper	Daily	Version 0.9 / Version 0.6	Switzerland	331	1981-01-01 to 2020-12-31	2025-03-14	793.1M
CAMELS-CL	Paper	Daily	Dataset	Chile	516	1913-02-15 to 2018-03-09	2018-09-28	208M
CAMELS-COL	Paper	Daily	Version 2	Colombia	347	1981-05 to 2022-12	2025-05	80.9M
CAMELS-DE	Paper	Daily	Version 1.1 / Version 0.1	Germany	1582	1951-01-01 to 2020-12-31	2025-08-07	2.2G
CAMELS-DK	Paper	Daily	Version 6.0	Denmark	304	1989-01-02 to 2023-12-31	2025-02-14	1.41G
CAMELS-FI	Meeting	Yearly/Daily	Version 1.0.1	Finland	320	1961-01-01 to 2023-12-31	2025-07	382M
CAMELS-FR	Paper	Daily/Monthly/Yearly	Version 3.2 / Version 3	France	654	1970-01-01 to 2021-12-31	2025-08-12	364M
CAMELS-GB	Paper	Daily	Dataset	United Kingdom	671	1970-10-01 to 2015-09-30	2025-05 (new data link)	244M
CAMELS-IND	Paper	Daily	Version 2.2	India	472 (242 sufficient flow)	1980-01-01 to 2020-12-31	2025-03-13	529.4M
CAMELS-LUX	Paper	Hourly/Daily	Version 1.1	Luxembourg	56	2004-11-01 to 2021-10-31	2024-09-27	1.4G
CAMELS-NZ	Paper	Hourly/Daily	Version 2 / Version 1	New Zealand	369	1972-01-01 to 2024-08-02	2025-08-05	4.81G
CAMELS-SE	Paper	Daily	Version 1	Sweden	50	1961-2020	2024-02	16.19M
CAMELS-US	Paper	Daily	Version 1.2	United States	671	1980-2014	2022-06-24	14.6G
CAMELSH-KR	-	Hourly	Version 1	South Korea	178	2000-2019	2025-03-23	3.1G
CAMELSH	Paper	Hourly	Version 6 + 3 + 2	United States	9008	1980-2024	2025-08-14	4.2G+3.57G+2.18G
Caravan-DK	Paper	Daily	Version 7 / Version 5	Denmark	308	1981-01-02 to 2020-12-31	2025-04-11	521.6M
Caravan	Paper / Code	Daily	Version 1.6	Global	16299	1950-2023	2025-05	24.8G
EStream	Paper / Code	Daily (weekly, monthly, yearly available)	Version 1.3 / Version 1.1	Europe	17130	1950-01-01 to 2023-06-30	2025-06-30	12.3G
GRDC-Caravan	Paper	Daily	Version 0.6 / Version 0.2	Global	5357	1950-2023	2025-05-06	16.4G
HYPE	Paper (draft)	Daily/Monthly/Yearly	Version 1.1	Costa Rica	605	1985-01-01 to 2019-12-31	2020-09-14	616.5M
HYSETS	Paper / Code	Daily	Dataset (dynamic attributes)	North America	14425	1950-01-01 to 2023-12-31	2024-09	41.9G
LamaH-CE	Paper	Daily/Hourly	Version 1.0	Central Europe	859	1981-01-01 to 2019-12-31	2021-08-02	16.3G
LamaH-Ice	Paper	Daily/Hourly	Version 1.5 / old version	Iceland	111	1950-01-01 to 2021-12-31	2025-08-12	9.6G
Simbi	Paper	Daily/Monthly	Version 6.0	Haiti	24	1920-01-01 to 2005-12-31	2024-07-02	125M

Key Features

🎯 Unified API Across All Datasets

Access any dataset using the same method calls:

# Same API works for all datasets
ds.read_object_ids()                          # Get basin IDs
ds.read_attr_xrdataset(...)                   # Read attributes
ds.read_ts_xrdataset(...)                     # Read timeseries

⚡ Fast NetCDF Caching

First access processes and caches data as NetCDF files. All subsequent reads are instant:

Timeseries data: {dataset}_timeseries.nc
Attribute data: {dataset}_attributes.nc
Configured via ~/hydro_setting.yml

🔄 Standardized Variable Names

Use common names across all datasets:

streamflow - River discharge
precipitation - Rainfall
temperature_max / temperature_min - Temperature extremes
potential_evapotranspiration - PET
And many more...

📊 xarray Integration

All data returned as xarray.Dataset objects:

Labeled dimensions and coordinates
Built-in metadata and units
Easy slicing, selection, and computation
Compatible with Dask for large datasets

Project Status & Future Work

The new, unified API architecture is currently in active development.

Current Implementation: hydrodataset provides access to 27 hydrological datasets (see the Supported Datasets table above). The new unified architecture based on the HydroDataset base class has been fully implemented and tested for camels_us and camels_aus datasets, which serve as reference implementations.
In Progress: We are in the process of migrating all other datasets supported by the library to this new architecture.
Release Schedule: We plan to release new versions frequently in the short term as more datasets are integrated. Please check back for updates.

Credits

This package was created with Cookiecutter and the giswqs/pypackage project template. The data fetching and reading is now powered by AquaFetch.

Name		Name	Last commit message	Last commit date
Latest commit History 511 Commits
.github		.github
docs		docs
examples		examples
hydrodataset		hydrodataset
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

hydrodataset

Table of Contents

Core Philosophy

Installation

Using uv (Recommended)

Using pip (Alternative)

Quick Start

Basic Example

Standardized Variable Names

Supported Datasets

Key Features

🎯 Unified API Across All Datasets

⚡ Fast NetCDF Caching

🔄 Standardized Variable Names

📊 xarray Integration

Project Status & Future Work

Credits

About

Uh oh!

Releases 16

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

OuyangWenyu/hydrodataset

Folders and files

Latest commit

History

Repository files navigation

hydrodataset

Table of Contents

Core Philosophy

Installation

Using uv (Recommended)

Using pip (Alternative)

Quick Start

Basic Example

Standardized Variable Names

Supported Datasets

Key Features

🎯 Unified API Across All Datasets

⚡ Fast NetCDF Caching

🔄 Standardized Variable Names

📊 xarray Integration

Project Status & Future Work

Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages