The code is entered through main.py. When run, it will create a DataSet_Builder object which essentially does the preprocessing on the data and can build a DataSet object which runs the machine learning methods. Results are saved to a json file for later viewing.
Run the code using
python3 main.py <dataset-level_params>.json <grid_search_params>.json
-
For example, one could run:
python3 main.py singleparam.json singleannparam.json
-
To get the dataset-level param results, we used:
python3 main.py params1.json singleannparam.json
-
To get the grid search results, use
python3 main.py params.json ann_params.json
Note that in order to break up the results, we actually ran several different files called
ann_params_*.json
These json files are constructed using parambuild.py
(more notes below) and
contain a list of dictionaries. The dictionaries contain the parameter values
used to construct the DataSet and run the models.
Here's a list of current files/folders that are relevant.
.
+-- dataset.py
+-- datasetbuilder.py
+-- main.py
+-- *.json
+-- gridsearch.ps1
+-- scrape.py
+-- parambuild.py
+-- OLS.py
+-- ANN.py
+-- TSNN.py
+-- _data
| +-- site*.pkl
| +-- (data.pkl)
| +-- info.txt
| +-- merged.csv
| +-- sitedict.py
+-- _results
| +-- _dataset-level_params
| +-- _grid_searches
We will not cover a full list of attributes here, but more information about each of the code files can be found in the files.
dataset.py
contains the DataSet class with these and other attributes:impute_inputs()
: takes in a future date and makes an estimation of the "X" input matrix values for that date by averaging values from that day and surrounding days from previous yearsrun_OLS()
: runs the OLS functions and stores resultsrun_ANN()
: runs the ANN functions and stores resultsrun_TSNN()
: runs the TSNN functions and stores results
datasetbuilder.py
: contains the DataSet_Builder class with these and other attributes:clean_df()
: drops rows with NaN or -99.9 valuesformat_date()
: converts to cylindrical representation of datesuse_rect_radius()
: reduces number of sites by rectangular radiususe_pca()
: uses PCA to reduce number of featuresremove_outliers()
: uses IsolationForest to remove outliersscale_data()
: min-max scales data from 0 to 1build_dataset()
: builds a DataSet object
main.py
is the main function to run, uses a DataSet_Builder object to build a DataSet object, runs ML methods, and save results*.json
: several files used as inputs tomain.py
gridsearch.ps1
: the final code was run on Windows, so this is a Windows PowerShell script that basically just runsmain.py
with different command line argumentsscrape.py
is a standalone script that scrapes the website for data- gets each site from
SITEDICT
- each year from 1980 to 2019
- gets each site from
parambuild.py
is a standalone script that uses lists and for loops to build a list of dictionaries of parameter combinations, both for dataset-level parameters and for neural net parameters used by the grid search. The json files it saves can be used as command line arguments formain.py
.OLS.py
contains functions that compute the OLS and make a predictionANN.py
contains functions that create a FFNN model, train it, and make a predictionTSNN.py
contains functions that train a time series recurrent neural net and make predictions for the next four weeks.data
foldersite*.pkl
: these are individual pickles for each site (all years)data.pkl
: overall data pickle- only appears after constructing
DataSet
object - is over 100 MB, so it's in the gitignore
- only appears after constructing
info.txt
: site info, including, number, name, latitude, longitudemerged.csv
: csv file of all data- currently ordered by date then site
- note that the data is rather sparse
- if you make changes, delete
data.pkl
(which will regenerate)
sitedict.py
: usesinfo.txt
to build a dictionarySITEDICT
of sites
results
folderdataset-level_params
folder: contains graphs and a json file of results for various combinations of dataset-level parameters, used to determine the optimal dataset-level parameters. The folder also contains an INFO.txt file with more informationgrid_search_params
folder: contains multiple subfolders with graphs and result json files that make up the grid search. The folder contains an INFO.txt file that has more information.