Skip to content

Project for data modeling, loading and normalization of the processing of air travel tickets data utilizing dbt to build ELT pipeline from 4,000+ row JSON source file. The project was selected by Data Stack Academy bootcamp to be incorporated into their dbt curriculum.

Notifications You must be signed in to change notification settings

rgiosa10/dbt-indpnt-proj

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt Data Modeling

By Ruben Giosa

This repo includes covers the same data modeling, loading, normalization exercise covered in my data-modeling-cr repository but leverages dbt instead. Data Stack Academy will be incorporating my project into their dbt curriculum going forward. Below is a snapshot of work from this project, which is outlined in more detail further down in the README.md for each portion:

snapshot of work


Technologies Used

  • Google BigQuery
  • dbt
  • SQL
  • Jinja
  • JSON
  • Git
  • Markdown
  • .gitignore
  • requirements.txt

Description

This repo includes exercises for data modeling, loading and normalization for the processing of air travel tickets. Upon inspection of the main tickets source file: data/air_travel/tickets.json it includes json row records. Each record includes an issued flight ticket such as:

{
  "eticket_num": "498-938211-0795",
  "confirmation": "ZVFDC4",
  "ticket_date": "2022-03-23",
  "price": 723.42,
  "seat": "31I",
  "status": "active",

  "airline": {
    "name": "China Eastern Airlines",
    "iata": "MU",
    "icao": "CES",
    "callsign": "CHINA EASTERN",
    "country": "China"
  },

  "origin": {
    "name": "Montreal / Pierre Elliott Trudeau International Airport",
    "city": "Montreal",
    "country": "Canada",
    "iata": "YUL",
    "icao": "CYUL",
    "latitude": 45.47,
    "longitude": -73.74,
    "altitude": 118,
    "tz_timezone": "America/Toronto"
  },

  "destination": {
    "name": "Chicago Midway International Airport",
    "city": "Chicago",
    "country": "United States",
    "iata": "MDW",
    "icao": "KMDW",
    "latitude": 41.79,
    "longitude": -87.75,
    "altitude": 620,
    "tz_timezone": "America/Chicago"
  },

  "passenger": {
    "first_name": "Robert",
    "last_name": "Brown",
    "gender": "M",
    "birth_date": "1969-02-17",
    "email": "[email protected]",
    "street": "5007 Thomas Way",
    "city": "Lake Hollystad",
    "state": "DC",
    "zip": "20027"
  }
}

A ticket contains a few different entities:

  • Ticket info: e-ticket number, confirmation code, price, ...
  • Airline: airline name, iata code, call sign, country, ...
  • Source and destination airports: airport name, iata code, location, ...
  • Passengers: passenger name, birth date, address



In this exercise, the below are performed:

  1. JSON file is loaded into BigQuery
  2. Development of an ELT pipeline normalizing the source file into our facts and dimensions



Exercise 1: Data Modeling

Using draw.io a data model (see below and also in main notebook), that meets the following requirements:

  1. Contains a tickets fact table
  2. Contains the following dimensions: airlines, airports, and passengers
  3. Develops passengers as an SCD Type2 dimension:
    • Passenger email can be used as the natural key
    • Be sure to add a surrogate key and effective start/end dates
  4. IATA codes can be used as the primary key for both airlines and airports
  5. Use the t-ticket number as the primary key for the tickets fact

Data Model:


data model



Exercise 2: Data Loading and Normalization

A ETL pipeline created that loads the dimensions and facts from the source file, and meets the following requirements:

General:

  • Leverages dbt, SQL and Jinja to create ELT that creates the dim and facts outlined in the data model above. The below Lineage Graph shows to final ELT pipeline:

lineage graph

Airlines Dim:

  • Identifies unique airlines
  • Uses IATA code as the dimension key

Airports Dim:

  • Identifies unique airports from both origin and destination fields
  • Uses IATA code as the dimension key

Passengers Dim:

  • Identifies unique passengers
  • Uses the passenger email as the dimension natural key
  • Uses UUIDs for the dimension surrogate keys
  • Sets the effective start date to a fixed set date in the past
  • Sets the effective end date to None
  • Passenger address columns are considered SCD Type 2 columns
  • All other columns are SCD Type 1

Tickets Fact:

  • Links to airlines and airports dimensions by their IATA codes.
  • Links to the passengers dimension by its surrogate key by performing a looking up to the passengers dimension
  • Loads all the tickets. See image below of dimension and fact tables loaded into BigQuery.

tables in Bigquery



Setup/Installation Requirements

  • Go to https://github.com/rgiosa10/dbt-indpnt-proj.git to find the specific repository for this website.
  • Then open your terminal. I recommend going to your Desktop directory:
    cd Desktop
  • Then clone the repository by inputting:
    git clone https://github.com/rgiosa10/dbt-indpnt-proj.git
  • Go to the new directory or open the directory folder on your desktop:
    cd dbt-indpnt-proj
  • Once in the directory you will need to set up a virtual environment in your terminal:
    python3.7 -m venv venv
  • Then activate the environment:
    source venv/bin/activate
  • pip install dbt with the BigQuery plugin:
pip install dbt-bigquery
  • Check the version to make sure it succeeded:
dbt --version

In GitHub, create a new repository for this demo. Follow the directions on how to create a new repository on the command line:

# create a README
echo "dbt demo project" >> README.md

# initialize the git project
git init
git add README.md

# create a commit and push to remote
git commit -m "first commit"
git branch -M main
git remote add origin <URL_for_GitHub_repo>
git push -u origin main 
  • Load the data files in the data folder to BigQuery under the project and create a dataset and table.

Now we'll initialize a dbt project. Let's call it "dbt_demo":

# initialize a dbt project
dbt init demo_dbt
  • You'll see a series of prompts. Choose BigQuery, with 'service_account' as the authentication method. If you don't remember the path to your JSON key, use echo $GOOGLE_APPLICATION_CREDENTIALS. Enter your GCP project ID (gcloud projects list for reference).

  • When you're done with the prompts, you'll see the message

Profile demo_dbt written to /home/<user_name>/.dbt/profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.
23:28:12  
Your new dbt project "demo_dbt" was created!
  • Go into the newly created '.dbt' folder and take a look at the profile file. Then, run the usual code . to open VSCode.

  • In your terminal, from within the demo_dbt directory, run dbt debug. This will check that the config files, dependencies, and connection to your BigQuery database are good to go.

  • Now you can run dbt run! (Again, make sure you're in the demo_dbt directory.) This will create a SQL table using the built-in demo models in the demo_dbt/models/example/ folder, and the air_travel database on BigQuery. If you look at your BigQuery page now, you'll see the dbt tables there.

  • With your virtual environment now enabled with proper requirements, open the directory:

    code .
  • If there is interest to access or view the contents of this project on BigQuery and Looker please reach out and we can discuss getting access.


Known Bugs

  • No known bugs

License

MIT License

Copyright (c) 2022 Ruben Giosa

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


About

Project for data modeling, loading and normalization of the processing of air travel tickets data utilizing dbt to build ELT pipeline from 4,000+ row JSON source file. The project was selected by Data Stack Academy bootcamp to be incorporated into their dbt curriculum.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages