v007 cloud read parameters and libraries #675
rwegener2
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
I looked into ATL03 read times times using a combination of cloud optimized parameters for the v007 data suggested in this whitepaper using both xarray and h5py.
xarrayandh5pyhave comparable read times across file sizesMethods & Results
Four v007 ATL03 files were compared for read speeds using one group of data (
gt1l). Theh_phvariable as well as relevant coordinate variables (lat_ph,lon_ph, anddelta_time) were read. Times shown are the mean time of 6 independent reads. Error bars show the standard deviation of those times. All files were read 7 times. The first read was not included in the averages because it often was much slower than the subsequent reads (often 2x slower), likely due to an optimization technique from s3 that speeds up sequential reads."both" indicates that both h5py and fsspec cloud optimized parameters were used (below). "neither" indicates that neither h5py nor fsspec cloud optimized parameters were used, so both libraries were using the default params. "h5py only" and "fsspec only" indicates that just one of the library's cloud optimized parameters were used while the other used the default parameters.
Overall, we see that:
Parameters
The cloud-optimized parameters used were:
The high level effect of each parameter is:
page_buf_size(h5py): The size of the pages in the datasetrdcc_nbytes(h5py): How many bytes of data to keep in the cachecache_type(fsspec): Download and cache data chunks from the file for caching (not, for example, the whole file)block_size(fsspec): How much data to request at once for bufferingData Sizes and Granules
ATL03_20190613013940_11570313_007_02.h5ATL03_20190613055526_11600309_007_02.h5ATL03_20190613070708_11610306_007_02.h5ATL03_20190611220139_11400305_007_02.h5*"Amount of data read" calculated using xarray's
ds.nbytesDiscussion / Some Thoughts
A conversation with @betolink yielded some helpful context for these results. There is some overlap between the services provided by
h5pyandfsspec, in that both libraries are trying to manage a data cache and data requests. It's not clear to me, personally, how these two interact with each other, but the slow down when using both optimized parameters makes me think there is some squalling happening between the libraries when they both try to dictate cache and request sizes. The h5py parameters may still make a difference when reading local data, however.fsspec'sblocksizeis the critical parameter for optimizing reads in this test.cache_typeshould not actually have had any effect on this test, since caching would speed up read times on multiple successive reads (not the case here). @betolink recommended 8 MB as the blocksize because it matches the page size that the HDF datafiles were created with.I was happy to see that the read times between h5py and xarray are comparable. The use of a higher level xarray data structure does not result in substantially slower reads, which luckily means that we aren't sacrificing usability of the output data structure for read speeds.
Actionable Takeaway
References & Related Work
fsspecdocs page describing parameters (link)h5pydocs page describing parameters (link)Raw values from the Figure
h5py (units = seconds)

xarray (units = seconds)

Beta Was this translation helpful? Give feedback.
All reactions