-
Notifications
You must be signed in to change notification settings - Fork 1
VTools 3.0 Development Notes
water-e edited this page Nov 8, 2019
·
1 revision
- Move to pandas/xarray
- Focus on netcdf and csv as data storage. De-emphasize dss.
- Reduce the functionality if idiomatic pandas does certain things better, but keep things like period_op where the backward compatible or preferable behavior is different.
- Move to python 3.
- vtools2 refers to the existing implementation of vtools, including its hand-rolled time series data structures. It is implemented in python 2
- vtools3 refers to the implementation that is switched entirely to pandas/xarray data structures as the main implementation for time series. The goal is stripped down. We simply will re-implement our most important capabilities and convenience functions and one of the goals here is to identify what these are.
- We use pandas Timestamp. For internal uses, the integer constructor Timestamp(2009,3,4) is used.
- In storage we will prefer the near-ISO format: 2009-02-10T00:00
- The initial implementation is pandas dataframe-centric, but that ultimatly a lot of our helpers will have to be migrated to xarray and dask to work on big data like output from the 3D SCHISM model.
- The helper functions like hours(3) and days(4) have been retained. As much as possible these attempt to produce the data types that Pandas considers "exact" so that you can do things like basic math operations on them
- Parsed strings should follow the pd.tseries.offset standards (15min, 1H, 1D, 1M). No distinct vtools nomenclature.
- The key sticking point here is the concept of a day. The pd.tseries.offset.Day is fine and it is the basis for the days(nday) factory function. Importantly, a DateOffset is not equivalent, because of DST quirks that we don't care about (daylight time). This also means that this kind of offset doesn't seem to support math (division of one interval by another). For this reason, legal "intervals" for functions based on regular time like tidal filtration will be based on pd.tseries.offset.Day. Even better would simply be to avoid this entirely and use minutes and hours.
- xarray and pd.DataFrames get built with pd.tseries.offsets as their "freq".
ts2 = resample(ts1, interval) # interval will become a DateOffset ts2 = cosine_lanczos(ts1, hours(40))
Contenders for the main backend are pandas dataframe (and series) and xarray.DataArray.
- pd.Dataframe seems to be natural analog for csv tables (data that has been reduced to 2d) and DataArray for multidimensional data.
- Functions api should work interoperably on these provided that the data structure has a "time" dimension, giving good multidimensional results as far as scaling will allow. Currently it isn't. Only pandas works even partially
- Mostly we should follow CF for things like units. Seems like only xarray adequately contains metadata slot?
- CF conventions leaves a couple items not well-defined for our purposes:
- sampling interval for data that are advertised as "regular"
- the CF way of expressing cell averages with cell_methods is a bit clunky for our main case, which is averages over prescribed regular intervals. In this case it is nice to just say whether the stamp is at the beginning or end of the interval and skip the cell boundaries attribute. Need to figure out how longer averages are typically done ... I tend to favor stamping January data on January 1, but we have to test converters carefully because this could cause a lot of issues with DSS data.
- Data Sources Main data sources will be:
- CSV which often converts nicely to pd. data frame and doesn't have metadata.
- netcdf, which often converts nicely to xarray dataset or dataarray and does have metadata
Writing to csv for 2D data arrays using pandas is trivial and we don't have to micromanage this, just make it convenient for some of the more popular cases. We should make sure the time formats are standardized for our own work, but maybe this isn't something we need to dictate. Should we have a way of describing name-value pair or units metadata for csv, or just agree to drop it? There is a W3 convention on it.
Use cases:
- DSS data: derived from dumped dss files. Decide handling of period ops and time stamp conventions.
- SCHISM station output
- All the read_ts() formats (NOAA, USGS, CDEC, etc), most can be re-handled with pd.read_csv. Should we continue with the sniffers or just expect the reader to know what they are loading?
- DSM2 outputs?
- netCDF files: a. SCHISM atmospheric b. UGRID???? Or should we punt to other peoples tools? c. Univariate series written using templated metadata so that we don't have to write every little detail concerning, say, DWR or USGS generated metadata.