The goal of this project is to develop a pipeline which implements a variety of machine learning algorithms for predicting solar irradiance, on Spark. The software's design centers on execution in Google Cloud Platform along with accessing pertinent data from a Google Storage Bucket.
This section describes the preqrequisites, and contains instructions, to get the project up and running.
This project can be easily set up on the Google Cloud Platform, by creating a cluster using Google's Cloud DataProc services. On doing so, 'Compute Engine VM Instances' will be created, from among which, you can open the VM instance of the master node.
NOTE: The conda-dataproc-bootstrap.sh
script in the "scripts" directory needs to be used as an "Initialization Action" while creating the cluster.
The Initialization Action will set up a conda
environment in all the VM instances, in which, Spark, latest version of Python, and the following libraries will be installed: findspark
, pytest
, pvlib
, siphon
,tables
, netCDF4
, pyAesCrypt
.
To copy the einstein
project to the VM instance, there are two alternatives:
-
Google Cloud SDK: When installed in the local machine, the user can copy the contents of this repository to the DataProc VM instance using the command:
> gcloud compute scp --recurse /complete/link/to/repository/* <user>@<instance_name-vm>:~/
-
The project directory can be compressed using the command:
> tar -zvcf einstein.tar.gz einstein
; the compressed folder can be uploaded into the VM instance using theUpload File
option in the VM, and can further be decompressed using the command:> tar -xzf einstein.tar.gz
.
Run the following command: bash einstein/scripts/setup.sh
to setup the bash utilities for the project.
To run the experiments and analyze solar irradiance predictions, the user can run the following command:
> einstein --options
The user can get a description of the options by using the command: > einstein --help
.
- Python
- PySpark - Python API for Apache Spark
- Spark MLLib
There are no specific guidelines for contributing, apart from a few general guidelines we tried to follow, such as:
- Code should follow PEP8 standards as closely as possible
- We use Google-Style docstrings to document the Python modules in this project.
If you see something that could be improved, send a pull request!
We are always happy to look at improvements and new experiments, to ensure that einstein
, as a project, is the best version of itself.
If you think something should be done differently (or is just-plain-broken), please create an issue.
See the Contributors file for details about individual contributions.