v007 cloud read parameters and libraries #675

rwegener2 · 2025-04-28T17:39:22Z

rwegener2
Apr 28, 2025
Maintainer

Overview

I looked into ATL03 read times times using a combination of cloud optimized parameters for the v007 data suggested in this whitepaper using both xarray and h5py.

Suggested parameters are given in the whitepaper for the h5py and fsspec libraries, but it appears that cloud reads for these files are fastest using only the fsspec parameters. h5py parameters make a very small change.
xarray and h5py have comparable read times across file sizes

Methods & Results

Four v007 ATL03 files were compared for read speeds using one group of data (gt1l). The h_ph variable as well as relevant coordinate variables (lat_ph, lon_ph, and delta_time) were read. Times shown are the mean time of 6 independent reads. Error bars show the standard deviation of those times. All files were read 7 times. The first read was not included in the averages because it often was much slower than the subsequent reads (often 2x slower), likely due to an optimization technique from s3 that speeds up sequential reads.

"both" indicates that both h5py and fsspec cloud optimized parameters were used (below). "neither" indicates that neither h5py nor fsspec cloud optimized parameters were used, so both libraries were using the default params. "h5py only" and "fsspec only" indicates that just one of the library's cloud optimized parameters were used while the other used the default parameters.

Overall, we see that:

Although h5py is slightly faster, xarray and h5py have similar read times across file sizes
The h5py optimized parameters make very little difference in read times, and in some cases make reads slower. The real speedup lies in the fsspec parameters.

Parameters

The cloud-optimized parameters used were:

h5py_params = {
      "page_buf_size": 16*1024*1024,
      "rdcc_nbytes": 4*1024*1024
 }
fsspec_params = {
      "cache_type": "blockcache", 
      "block_size": 8*1024*1024
}

The high level effect of each parameter is:

page_buf_size (h5py): The size of the pages in the dataset
rdcc_nbytes (h5py): How many bytes of data to keep in the cache
cache_type (fsspec): Download and cache data chunks from the file for caching (not, for example, the whole file)
block_size (fsspec): How much data to request at once for buffering

Data Sizes and Granules

File Name	Total File Size	Amount of Data Read*	Granule ID
small	184 MB	~70MB	`ATL03_20190613013940_11570313_007_02.h5`
medium	752 MB	~430 MB	`ATL03_20190613055526_11600309_007_02.h5`
mediumlarge	3.82 GB	~2.4 GB	`ATL03_20190613070708_11610306_007_02.h5`
large	8.07 GB	~4.1 GB	`ATL03_20190611220139_11400305_007_02.h5`

*"Amount of data read" calculated using xarray's ds.nbytes

Discussion / Some Thoughts

A conversation with @betolink yielded some helpful context for these results. There is some overlap between the services provided by h5py and fsspec, in that both libraries are trying to manage a data cache and data requests. It's not clear to me, personally, how these two interact with each other, but the slow down when using both optimized parameters makes me think there is some squalling happening between the libraries when they both try to dictate cache and request sizes. The h5py parameters may still make a difference when reading local data, however.

fsspec's blocksize is the critical parameter for optimizing reads in this test. cache_type should not actually have had any effect on this test, since caching would speed up read times on multiple successive reads (not the case here). @betolink recommended 8 MB as the blocksize because it matches the page size that the HDF datafiles were created with.

I was happy to see that the read times between h5py and xarray are comparable. The use of a higher level xarray data structure does not result in substantially slower reads, which luckily means that we aren't sacrificing usability of the output data structure for read speeds.

Actionable Takeaway

I think that icepyx should use only the fsspec optimized parameters when reading cloud data. The h5py optimized parameters should not be included for the benefit of the cloud optimized reads. If additional read testing shows that reads are faster on local v007 data with the h5py params included then it may be worth including those parameters in icepyx.

References & Related Work

Evaluating Cloud-Optimized HDF5 for NASA’s ICESat-2 Mission (link)
s3fs Parameter Testing on HDF5 GEDI data (link)
fsspec docs page describing parameters (link)
h5py docs page describing parameters (link)

Raw values from the Figure

h5py (units = seconds)

xarray (units = seconds)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v007 cloud read parameters and libraries #675

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v007 cloud read parameters and libraries #675

Uh oh!

Uh oh!

rwegener2 Apr 28, 2025 Maintainer

Overview

Methods & Results

Parameters

Data Sizes and Granules

Discussion / Some Thoughts

Actionable Takeaway

References & Related Work

Replies: 0 comments

rwegener2
Apr 28, 2025
Maintainer