This folder contains my project for the Data Engineering Zoomcamp by DataTalks.Club.
- Data Engineering Zoomcamp Project
- Problem
- Dataset
- Dashboard
- Project details and implementation
- Reproduce the project
- Prerequisites
- Create a Google Cloud Project
- Creating an environment variable for the credentials
- Install and setup Google Cloud SDK
- Create a VM instance
- Set up SSH access to the VM
- Starting and stopping your instance with gcloud sdk after you shut it down
- Installing the required software in the VM
- Upload/download files to/from your instance
- Clone the repo in the VM
- Set up project infrastructure with Terraform
- Set up data ingestion with Airflow
- Perform the data ingestion
- Setting up dbt Cloud
- Deploying models in dbt Cloud with a Production environment
- Creating a dashboard
This is a simple project which takes data from the GitHub Archive and transforms it in order to visualize the top GitHub contributors as well as the amount of commits per day.
The chosen dataset for this project is the GitHub Archive. The dataset contains every single public event made by GitHub users, from basic repo activity such as commits, forks and pull requests to comments, actions and all other events.
You may access the dashboard with the visualizations in this link.
This project makes use of Google Cloud Platform, particularly Cloud Storage and BigQuery.
Cloud infrastructure is mostly managed with Terraform, except for Airflow and dbt instances (detailed below in the Reproduce the project section).
Data ingestion is carried out by an Airflow DAG. The DAG downloads new data hourly and ingests it to a Cloud Storage bucket which behaves as the Data Lake for the project. The dataset is in JSON format; the DAG transforms it in order to get rid of any payload objects and parquetizes the data before uploading it. The DAG also creates an external table in BigQuery for querying the parquet files.
The Data Warehouse is defined with dbt. It creates a table with all the info in the parquet files. The table is partitioned by day and clustered on actor ID's.
dbt is also used for creating the transformations needed for the visualizations. A view is created in a staging phase containing only the PushEvents (a Push Event contains one or more commits), and a final table containing the commit count per user is materialized in the deployment phase.
The visualization dashboard is a simple Google Data Studio report with 2 widgets.
The following requirements are needed to reproduce the project:
- A Google Cloud Platform account.
- (Optional) The Google Cloud SDK. Instructions for installing it are below.
- Most instructions below will assume that you are using the SDK for simplicity.
- (Optional) A SSH client.
- All the instructions listed below assume that you are using a Terminal and SSH.
- (Optional) VSCode with the Remote-SSH extension.
- Any other IDE should work, but VSCode makes it very convenient to forward ports in remote VM's.
Development and testing were carried out using a Google Cloud Compute VM instance. I strongly recommend that a VM instance is used for reproducing the project as well. All the instructions below will assume that a VM is used.
Access the Google Cloud dashboard and create a new project from the dropdown menu on the top left of the screen, to the right of the Google Cloud Platform text.
After you create the project, you will need to create a Service Account with the following roles:
BigQuery Admin
Storage Admin
Storage Object Admin
Viewer
Download the Service Account credentials file, rename it to google_credentials.json
and store it in your home folder, in $HOME/.google/credentials/
.
IMPORTANT: if you're using a VM as recommended, you will have to upload this credentials file to the VM.
You will also need to activate the following APIs:
- https://console.cloud.google.com/apis/library/iam.googleapis.com
- https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com
If all of these steps are unfamiliar to you, please review Lesson 1.3.1 - Introduction to Terraform Concepts & GCP Pre-Requisites (YouTube), or check out my notes here.
Create an environment variable called GOOGLE_APPLICATION_CREDENTIALS
and assign it to the path of your json credentials file, which should be $HOME/.google/credentials/
. Assuming you're running bash:
- Open
.bashrc
:nano ~/.bashrc
- At the end of the file, add the following line:
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/authkeys>.json"
- Exit nano with
Ctrl+X
. Follow the on-screen instructions to save the file and exit. - Log out of your current terminal session and log back in, or run
source ~/.bashrc
to activate the environment variable. - Refresh the token and verify the authentication with the GCP SDK:
gcloud auth application-default login
- Download Gcloud SDK from this link and install it according to the instructions for your OS.
- Initialize the SDK following these instructions.
- Run
gcloud init
from a terminal and follow the instructions. - Make sure that your project is selected with the command
gcloud config list
- Run
- From your project's dashboard, go to Cloud Compute > VM instance
- Create a new instance:
- Any name of your choosing
- Pick your favourite region. You can check out the regions in this link.
IMPORTANT: make sure that you use the same region for all of your Google Cloud components.
- Pick a E2 series instance. A e2-standard-4 instance is recommended (4 vCPUs, 16GB RAM)
- Change the boot disk to Ubuntu. The Ubuntu 20.04 LTS version is recommended. Also pick at least 30GB of storage.
- Leave all other settings on their default value and click on Create.
The following command will create a VM using the recommended settings from above. Make sure that the region matches your choice:
gcloud compute instances create <name-of-the-vm> --zone=<google-cloud-zone> --image-family=ubuntu-2004-lts --image-project=ubuntu-os-cloud --machine-type=e2-standard-4 --boot-disk-size=30GB
- Start your instance from the VM instances dashboard in Google Cloud.
- In your local terminal, make sure that the gcloud SDK is configured for your project. Use
gcloud config list
to list your current config's details.- If you have multiple google accounts but the current config does not match the account you want:
- Use
gcloud config configurations list
to see all of the available configs and their associated accounts. - Change to the config you want with
gcloud config configurations activate my-project
- Use
- If the config matches your account but points to a different project:
- Use
gcloud projects list
to list the projects available to your account (it can take a while to load). - use
gcloud config set project my-project
to change your current config to your project.
- Use
- If you have multiple google accounts but the current config does not match the account you want:
- Set up the SSH connection to your VM instances with
gcloud compute config-ssh
- Inside
~/ssh/
a newconfig
file should appear with the necessary info to connect. - If you did not have a SSH key, a pair of public and private SSH keys will be generated for you.
- The output of this command will give you the host name of your instance in this format:
instance.zone.project
; write it down.
- Inside
- You should now be able to open a terminal and SSH to your VM instance like this:
ssh instance.zone.project
- In VSCode, with the Remote SSH extension, if you run the command palette and look for Remote-SSH: Connect to Host (or alternatively you click on the Remote SSH icon on the bottom left corner and click on Connect to Host), your instance should now be listed. Select it to connect to it and work remotely.
- List your available instances.
gcloud compute instances list
- Start your instance.
gcloud compute instances start <instance_name>
- Set up ssh so that you don't have to manually change the IP in your config files.
gcloud compute config-ssh
- Once you're done working with the VM, you may shut it down to avoid consuming credit.
gcloud compute instances stop <instance_name>
- Run this first in your SSH session:
sudo apt update && sudo apt -y upgrade
- It's a good idea to run this command often, once per day or every few days, to keep your VM up to date.
- Run
sudo apt install docker.io
to install it. - Change your settings so that you can run Docker without
sudo
:- Run
sudo groupadd docker
- Run
sudo gpasswd -a $USER docker
- Log out of your SSH session and log back in.
- Run
sudo service docker restart
- Test that Docker can run successfully with
docker run hello-world
- Run
- Go to https://github.com/docker/compose/releases and copy the URL for the
docker-compose-linux-x86_64
binary for its latest version.- At the time of writing, the last available version is
v2.2.3
and the URL for it is https://github.com/docker/compose/releases/download/v2.2.3/docker-compose-linux-x86_64
- At the time of writing, the last available version is
- Create a folder for binary files for your Linux user:
- Create a subfolder
bin
in your home account withmkdir ~/bin
- Go to the folder with
cd ~/bin
- Create a subfolder
- Download the binary file with
wget <compose_url> -O docker-compose
- If you forget to add the
-O
option, you can rename the file withmv <long_filename> docker-compose
- Make sure that the
docker-compose
file is in the folder withls
- If you forget to add the
- Make the binary executable with
chmod +x docker-compose
- Check the file with
ls
again; it should now be colored green. You should now be able to run it with./docker-compose version
- Check the file with
- Go back to the home folder with
cd ~
- Run
nano .bashrc
to modify your path environment variable:- Scroll to the end of the file
- Add this line at the end:
export PATH="${HOME}/bin:${PATH}"
- Press
CTRL
+o
in your keyboard and press Enter afterwards to save the file. - Press
CTRL
+x
in your keyboard to exit the Nano editor.
- Reload the path environment variable with
source .bashrc
- You should now be able to run Docker compose from anywhere; test it with
docker-compose version
- Run
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
- Run
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
- Run
sudo apt-get update && sudo apt-get install terraform
Make sure that you upload the google_credentials.json
to $HOME/.google/credentials/
and you create the GOOGLE_APPLICATION_CREDENTIALS
as specified in the Creating an environment variable for the credentials section.
-
Download a file.
# From your local machine scp <instance_name>:path/to/remote/file path/to/local/file
-
Upload a file.
# From your local machine scp path/to/local/file <instance_name>:path/to/remote/file
-
You can also drag & drop stuff in VSCode with the remote extension.
-
If you use a client like Cyberduck, you can connect with SFTP to your instance using the
instance.zone.project
name as server, and adding the generated private ssh key.
Log in to your VM instance and run the following from your $HOME
folder:
git clone https://github.com/ziritrion/dataeng-zoomcamp.git
The contents of the project can be found in the dataeng-zoomcamp/7_project
folder.
IMPORTANT: I strongly suggest that you fork my project and clone your copy so that you can easily perform changes on the code, because you will need to customize a few variables in order to make it run with your own infrastructure.
Make sure that the credentials are updated and the environment variable is set up.
-
Go to the
dataeng-zoomcamp/7_project/terraform
folder. -
Open
variables.tf
and edit line 11 under thevariable "region"
block so that it matches your preferred region. -
Initialize Terraform:
terraform init
-
Plan the infrastructure and make sure that you're creating a bucket in Cloud Storage as well as a dataset in BigQuery
terraform plan
-
If the plan details are as expected, apply the changes.
terraform apply
You should now have a bucket called data_lake
and a dataset called gh-archive-all
in BigQuery.
- Go to the
dataeng-zoomcamp/7_project/airflow
folder. - Run the following command and write down the output:
echo -e "AIRFLOW_UID=$(id -u)"
- Open the
.env
file and change the value ofAIRFLOW_UID
for the value of the previous command. - Change the value of
GCP_PROJECT_ID
for the name of your project id in Google Cloud and also change the value ofGCP_GCS_BUCKET
for the name of your bucket. - Build the custom Airflow Docker image:
docker-compose build
- Initialize the Airflow configs:
docker-compose up airflow-init
- Run Airflow
docker-compose up
You may now access the Airflow GUI by browsing to localhost:8080
. Username and password are both airflow
.
IMPORTANT: this is NOT a production-ready setup! The username and password for Airflow have not been modified in any way; you can find them by searching for
_AIRFLOW_WWW_USER_USERNAME
and_AIRFLOW_WWW_USER_PASSWORD
inside thedocker-compose.yaml
file.
If you performed all the steps of the previous section, you should now have a web browser with the Airflow dashboard.
The DAG is set up to download all data starting from April 1st 2022. You may change this date by modifying line 202 of dataeng-zoomcamp/7_project/airflow/dags/data_ingestion.py
. It is not recommended to retrieve data earlier than January 1st 2015, because the data was retrieved with a different API and it has not been tested to work with this pipeline. Should you change the DAG date, you will have to delete the DAG in the Airflow UI and wait a couple of minutes so that Airflow can pick up the changes in the DAG.
To trigger the DAG, simply click on the switch icon next to the DAG name. The DAG will retrieve all data from the starting date to the latest available hour and then perform hourly checks on every 30 minute mark.
After the data ingestion, you may shut down Airflow by pressing Ctrl+C
on the terminal running Airflow and then running docker-compose down
, or you may keep Airflow running if you want to update the dataset every hour. If you shut down Airflow, you may also shut down the VM instance because it won't be needed for the following steps.
- Create a dbt CLoud account.
- Create a new project.
- Give it a name (
gh-archive
is recommended), and under Advanced settings, input7_project/dbt
as the Project subdirectory. - Choose BigQuery as a database connection.
- Choose the following settings:
- You may leave the default connection name.
- Upload a Service Account JSON file > choose the
google_credentials.json
we created previously. - Under BigQuery Optional Settings, make sure that you put your Google Cloud location under Location.
- Under Development credentials, choose any name for the dataset. This name will be added as a prefix to the schemas. In this project the name
dbt
was used. - Test the connection and click on Continue once the connection is tested successfully.
- In the Add repository from form, click on Github and choose your fork from your user account. Alternatively, you may provide a URL and clone the repo.
- Give it a name (
- Once the project has been created, you should now be able to click on the hamburger menu on the top left and click on Develop to load the dbt Cloud IDE.
You may now run the dbt run
command in the bottom prompt to run all models; this will generate 3 different datasets in BigQuery:
<prefix>_dwh
hosts the data warehouse materialized table with all of the ingested data.<prefix>_staging
hosts the staging views for generating the final end-user tables.<prefix>_core
hosts the end-user tables.
- Click on the hamburger menu on the top left and click on Environments.
- Click on the New Environment button on the top right.
- Give the environment a name (
Production
is recommended), make sure that the environment is of type Deployment and in the Credentials section, you may input a name in the Dataset field; this will add a prefix to the schemas, similarly to what we did in when setting up the development environment (production
is the recommended prefix but any prefix will do, or you may leave it blank). - Create a new job with the following settings:
- Give it any name;
dbt run
is recommended. - Choose the environment you created in the previous step.
- Optionally, you may click on the Generate docs? checkbox.
- In the Commands section, add the command
dbt run
- In the Triggers section, inside the Schedule tab, make sure that the Run on schedule? checkbox is checked. Select custom cron schedule and input the string
40 * * * *
; this will run the models every hour on the 40th minute (the DAG runs on the 30th minute, the 10 minute delay is to make sure that the DAG is run successfully).
- Give it any name;
- Save the job.
You may now trigger the job manually or you may wait until the scheduled trigger to run it. The first time you run it, 3 new datasets will be added to BigQuery following the same pattern as in the development environment.
The dashboard used in this project was generated with Google Data Studio (GDS from now on). Dashboards in GDS are called reports. Reports grab data from data sources. We will need to generate 2 data sources and a report:
- Generate the data sources.
- Click on the Create button and choose Data source.
- Click on the BigQuery connector.
- Choose your Google Cloud project, choose your production core dataset and click on the
users
table. Click on the Connect button at the top. - You may rename the data source by clicking on the name at the top left of the screen. The default name will be the name of the chosen table.
- Click on the GDS icon on the top left to go back to the GDS front page and repeat all of the previous steps but choose the production staging dataset and the
stg_commits
table.
- Generate the report.
- Click on the Create button and choose Report.
- If the Add data to report pop-up appears, choose the
My data sources
tab and choose one of the data sources. Click on the Add to report button on the confirmation pop-up. - Once you're in the report page, click on the Add data button on the top bar and choose the other data source you created.
- You may delete any default widgets created by GDS.
- Add the Top Github Contributors widget.
- Click on the Add a chart button on the top bar and select Table.
- On the left bar, in Data source choose the dats source with the
users
table. - In Dimension, choose
actor_login
as the only dimension. - In Metric, choose
commit_count
as the only metric. - In Sort, choose
commit_count
and click on Descending.
- Add the Commits per day widget.
- Click on the Add a chart button on the top bar and select Time series chart.
- On the left bar, in Data source choose the dats source with the
stg_commits
table. - In Dimension, choose
created_at
as the only dimension. - In Metric, choose
Record Count
as the only metric.
You should now have a functioning dashboard.