Creating a virtual environment for this repo, as well as downloading the initial package requirements can be done via the following code:
source initial_config.sh
Next, we need to initialize the database. Run the following to initialize the database:
psql postgres
Once you are in the SQL shell, run
CREATE DATABASE airflow_works;
Run \l
to check that the database is created.
Next, run this whenever you are in this repo to reset the Airflow constants:
source config.sh
As a validation setp, run the terminal window run echo $AIRFLOW_WORKS_DBURL
to see if the Airflow config is linked to the correct database.
The name of the database should be equal to the name after the postgres://localhost:port_number/
.
Run psql $AIRFLOW_WORKS_DBURL
to see if you can psql into the database. If you can, \q
out from the
sql shell and you are ready to roll.
- Follow through on the installation requirements set above.
- After the database is correctly setup and linked to the Airflow via a config,
open a terminal window with the correct pyenv, run
airflow webserver -p 8080
- Open another terminal window with correct the correct pyenv (will be automated), run
airflow scheduler
- Follow the instructions here: http://airflow.apache.org/start.html
- Creating a setup for a sample DAG (follow through all the instructions to install pre-req packages)
- Create a PostgreSQL databse (follow steps here) (in my case, I called it
airflow_works
) - Create a new user with a new password
- Test run the DAG using localhost:8080, and create a new connection for the database created
- Create tasks in a sample DAG
- Created a game (rock-paper-scissors) that spit out the results
- Run the game on demand and store the results with timestamp in the database using Airflow
- (WIP) Create unit tests for tasks in a sample DAG
- Unit-testing the functionality of the rock-paper-scissors game
- Validation test framework for result logs coming out from the game
- Create API connection for external data crawler
- Read https://dev.socrata.com/consumers/getting-started.html for starters on Socrata API.
- Setup API for data from https://data.cms.gov/browse?q=Medicare%20Provider%20Utilization%20and%20Payment%20Data%3A%202015%20Part%20D%20Prescriber&sortBy=relevance (relevant to https://github.com/sfbrigade/datasci-open-payments)
- Read up Definition of ETL
Vineet Goel (Robinhood): (Why Robinhood Uses Airflow)[https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8]
()Useful Quora for .bashrc/.bash_profile)[https://www.quora.com/What-is-bash_profile-and-what-is-its-use]