Skip to content

datalayer

Tom Russell edited this page Feb 11, 2019 · 10 revisions

Design of the data layer

DataHandle is provided to a Model at runtime

A data handle hides some details from a model - inputs are inputs, regardless of their source (from scenario data or other model outputs); outer iterations (decision/convergence) are hidden; current/previous/base timesteps are exposed as convenience.

Input data and parameters both deal with the DataArray abstraction over multi-dimensional inputs.

State and interventions deal with the decisions available to decision/optimisation modules.

current_timestep: int
previous_timestep: int
base_timestep: int
timesteps: list[int]

get_data(input_name, timestep=None): DataArray
get_base_timestep_data(input_name): DataArray
get_previous_timestep_data(input_name): DataArray

get_parameters(): dict[str, DataArray]
get_parameter(parameter_name):  DataArray

get_state(): list[BuildInstruction]
get_current_interventions(): list[Intervention]

set_results(output_name, data)

A ResultsHandle is available to decision/optimisation modules, with different access levels - results from any model/output, from any timestep/decision iteration that's already run, but no write access and no parameter/scenario data access.

get_results(model_name, output_name, timestep, decision_iteration): DataArray

A Store interface provided lower-level methods

A Store handles data to configure, set up and execute model runs. Public methods are listed further below. A Store might be composed of a ConfigStore and a DataStore, which might have different implementations to suit storing the various types of data.

Config

Uses

  • a modeller writes configuration data to the store, directly or through the app
  • a model may reflect on configuration data to understand its inputs, outputs and parameters and their dimensions
  • smif model runner reads config to set up and run a ModelRun

Qualities

  • config typically reflects smif object structures
  • config objects often refer to others (as children or shared metadata)

Data

Uses

  • a modeller or data owner sets up input data, writing input data to the store
  • a model reads input data, writes results data, and/or reads results data produced by other models

Qualities

  • data typically has several dimensions, sometimes zero or one
  • data can often be represented in 'tidy' columnar format
  • metadata is sometimes shared between datasets
  • there are often several variants of the same dataset (scenarios, parameterisations)
  • data is sometimes sparse; sometimes it makes sense to use many default values and override targeted portions

Metadata

There's some tension between normalising/denormalising metadata definitions:

  • it's convenient and more immediately accessible to have self-describing data - e.g. storing datasets with geographical definitions always with geometries (in ShapeFiles, NetCDF, other as appropriate)
  • where we have many variants of a given dataset, it seems wasteful of disk to duplicate the same definitions (e.g. geometries) with every variant
  • when reconciling different model inputs and different data sources it's useful to have a clear way to point to shared definitions (or adjacent, differing definitions) of the metadata

Uses

  • a modeller or data owner sets up input data with new dimensions (set of categories, spatial zones, timestep coverage)
  • a modeller or data owner sets up new input data that should match some shared metadata

Qualities

  • typically lists with short identifiers and potentially large descriptions
    • spatial: vector geometries (with CRS), other attributes;
    • temporal: interval definitions
    • categorical: descriptions, short/long ids, cross-references
read_model_runs(): list[ModelRun]
read_model_run(model_run_name): ModelRun
write_model_run(model_run)
update_model_run(model_run_name, model_run)
delete_model_run(model_run_name)

read_sos_models(): list[SosModel]
read_sos_model(sos_model_name): SosModel
write_sos_model(sos_model)
update_sos_model(sos_model_name, sos_model)
delete_sos_model(sos_model_name)

read_sector_models(skip_coords=False): list[SectorModel]
read_sector_model(sector_model_name, skip_coords=False): SectorModel
write_sector_model(sector_model)
update_sector_model(sector_model_name, sector_model)
delete_sector_model(sector_model_name)

read_sector_model_parameter(sector_model_name, parameter_name): Spec
read_sector_model_parameter_default(sector_model_name, parameter_name): DataArray
write_sector_model_parameter_default(sector_model_name, parameter_name, data)

read_strategies(modelrun_name): list[Strategy]
write_strategies(modelrun_name, strategies)

read_interventions(sector_model_name): list[Intervention]

read_initial_conditions(sector_model_name): list[BuildInstruction]
read_all_initial_conditions(model_run_name): list[BuildInstruction]

read_state(modelrun_name, timestep, decision_iteration=None): list[Intervention]
write_state(state, modelrun_name, timestep, decision_iteration=None)

read_unit_definitions(): list[PintDefinitionString]

read_dimensions(): list[Coords]
read_dimension(dimension_name): Coords
write_dimension(dimension)
update_dimension(dimension_name, dimension)
delete_dimension(dimension_name)

read_coefficients(source_spec, destination_spec): numpy.ndarray
write_coefficients(source_spec, destination_spec, data)

read_scenarios(skip_coords=False): list[ScenarioModel]
read_scenario(scenario_name, skip_coords=False): ScenarioModel
write_scenario(scenario)
update_scenario(scenario_name, scenario)
delete_scenario(scenario_name)

read_scenario_variants(scenario_name): list[Variant]
read_scenario_variant(scenario_name, variant_name): Variant
write_scenario_variant(scenario_name, variant)
update_scenario_variant(scenario_name, variant_name, variant)
delete_scenario_variant(scenario_name, variant_name)

read_scenario_variant_data(scenario_name, variant_name, variable, timestep=None): DataArray
write_scenario_variant_data(scenario_name, variant_name, data_array, timestep=None)

read_narrative_variant_data(sos_model_name, narrative_name, variant_name,
write_narrative_variant_data(sos_model_name, narrative_name, variant_name,

read_results(modelrun_name, model_name, output_spec, timestep=None, decision_iteration=None): DataArray
write_results(data_array, modelrun_name, model_name, timestep=None, decision_iteration=None)

Database design and interface

The database is one data store method/option which can be used by smif. The interface, part of the datalayer, provides an access point for the management of the stored data.

Implimented store

PostgreSQL relational database store for configuration data. Built by option of user locally.

Methods

The interface supports methods for the datalayer for the writing, reading and management of data already in the database.

  • writing
    • supportd the writing of new data - should be passed as a dictionary
  • reading
    • allows reading of data from the database - returned data passed as a dictionary
  • updating
    • allows the updating of data already in the database - pass a dictionary with only the data to be updated
      • to add/discuss - second option to pass the full object definition including values which are to not change?
  • deleting
    • delete existing data from the database

Stories

Decide how to get initial values for between-timestep data

Labels:

  • data-layer
  • smif

Notes from github: https://github.com/nismod/smif/pull/276#pullrequestreview-182210297

There is some code in the data_handle.get_data() method (_resolve_dependency) which seems to do something similar. Is it worth removing that, and ensuring that users explicitly define which data source to use in base timesteps and future timesteps, or do we want it to be resolved automatically in the data_handle?

...

The code in data_handle decides which source to pull from - scenario or previous model result - depending on whether we're in the base year. I think this is okay - it avoids having two differently-named inputs, one of which is only provided with data in the base year, the other only in non-base years.

I thought it was more ambiguous how best to handle a request for data from a timestep previous to the base timestep - always base_timestep - 1? base_timestep - inferred_timestep_stride? just base_timestep?

Move setup functionality into Store implementations

Labels:

  • data-layer

  • smif

    Add "initialize" method to Store to setup an empty store

  • A store should be responsible for setting up itself

  • a DataFileInterface should set up the folder structure it requires

  • a DatabaseInterface should set up the tables and connection it needs

Remove references to file locations from config store

Labels:

  • data-layer

  • smif

    DataStore should be responsible for knowing where its files are, ConfigStore shouldn't care about data locations.

  • should simplify Store implementation

  • suggest compound key lookup e.g. (a,b,c,d): file.ext could be stored in YAML at root of data file folder

Store to return objects from config calls

Labels:

  • data-layer
  • smif

Ensure validation takes place at the Store layer

Test available_results implementations

Labels:

  • data-layer
  • smif

Currently skipping TestWarmStart in test_data_store_csv.py as the implementation has changed - warm start is implemented in Store, DataStore implementations should only be concerned with reporting available_results

(line 329 onwards)