Skip to content
praftery edited this page Sep 28, 2015 · 2 revisions

Input file

The program reads an input file as described by the input file format page. At a minimum, this input file contains a list of datetimes with associated target data (typically either energy consumption or power). Most files also contain other information that affect the target data, usually at least the outdoor dry-bulb air temperature.

Input features

Mave pre-processes the input file into a set of input features and target values. The target values are the values that mave is trying to predict. Input features are the values that a model uses to make a prediction. Both are needed to train a model, but only the input features are needed to make a prediction once a model has been trained.

By default, mave splits the datetime 'column' into 5 separate input features representing: the minute; hour of day; day of week; month; and whether or not the datetime was on (or near) a holiday. Mave uses the outdoor drybulb air temperature (if present in the file) to generate additional input features that represent the temperature a number of hours immediately prior to the meaurement at that datetime. This helps captures the lag effects caused by thermal mass, such as an unusally warm or cool night, or a pre-cooling strategy. By default, mave generates 2 additional outdoor dry-bulb temperature input features in the last 24 hours - the temperature 8 hours and 16 hours prior to the current datetime. Mave uses other data in the input file as a single input feature. Examples might include 'total building occupancy' or 'units produced'.

Standardization

If one input feature has a much bigger range (0-100000) than another input feature (0-1), then some machine learning methods are more likely to add weight to the former feature regardless of the actual correlation with the target data. To avoid this issue, it is best practice to standardize the input features and the target data so that each has zero mean and unit variance (mean value is 0 and standard deviation is 1). Mave performs this step internally.

Splitting the input data

Mave also needs to know which periods define the pre-retrofit period and which define the post-retrofit period. There are a number of different methods to do this that allow for flexibility:

Define a 'test_size':

This is the simplest method to define the post-retrofit data period. The 'test_size' is the fraction of the file that contains the post-retrofit data. e.g. a value of 0.2 means that the first 80% of the file is the pre-retrofit data and the last 20% of the file is post_retrofit data. The default value is 0.25. Users can specify a different value either as a command line argument, or in the configuration file.

Define a changepoint:

This allows the user to define multiple periods within the same input file. By default, all of the input file is assumed to be pre-retrofit data. Adding a changepoint with a post-retrofit tag will split the input file into pre- and post-retrofit data at that point. For example, when this changepoint is added to the configuration file, means that the post-retrofit period begins on January 5th, 2013 at 00:00.

2013-05-01T0000 = post

Users can also specify a single changepoint as a command line argument for convenience. This is assumed to represent the start of the post-retrofit period. The following comand line example will build a model using the data in ex3.csv, and assume that the last 45% of the file represents the post-retrofit period.

mave ex3.csv -ts 0.45
Define a changepoint and discard some data:

Similarly, the changepoint feature allows you to specify some data to be discarded from both the pre-retrofit and post-retrofit periods. This can be useful if there is a known period when the measured data is not energy consumption data (the target data) is not representative, and wishes to have excluded from the file. Defining it this way allows the historical input features. For example, this list of changepoints means that the energy consumption data after January 4th, 2014 at 00:00 will be ignored up until the post-retrofit period begins on January 5th, 2014 at 00:00. It is important to note that the historical outdoor air temperature data (if present) will still be used, which is the advantage of using the changepoint feature instead of simply deleting those rows from the input file entirely.

2014-01-04T0000 = ignore 
2014-01-05T0000 = post
Define multiple changepoints:

Lastly, an unlimited number of changepoints can be defined. This is useful where the 'retrofit' under evaluation is a building controls retrofit that can be switched on and off (perhaps multiple times). For example, the following list of changepoints means the model includes the data before January 1st, 2014 at 00:00 in the pre-retrofit data, discards the next day, includes the following week in the post-retrofit data, discards the next day, includes the following week as pre-retrofit data, discards the next day, and includes all the subsequent data as post-retrofit data.

2014-01-01T0000 = ignore 
2014-01-02T0000 = post
2014-01-09T0000 = ignore
2014-01-10T0000 = pre 
2014-01-17T0000 = ignore
2014-01-18T0000 = post