Sparkify Data Warehouse

ETL Pipeline in AWS Redshift and S3

Project Summary

In this project, I build an ETL Pipeline (Extraction, Transformation, Loading) of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from Amazon Web Service's (AWS) Simple Storage Service (S3) into staging tables in AWS Redshift (for data warehouses).

I then query the staged data into an analytics table. This will help Sparkify's analytics team learn insight about its customer base.

File Descriptions

create_tables.py

create fact and dimension tables for the star schema in Redshift.

sql_queries.py

define SQL statements, which will then be imported into the other files.

etl.py

load data from S3 into staging tables on Redshift, and then process that data into analytics tables on Redshift.

Design Decisions

Keyspace Star Schema

The star schema is used, with a fact table centered around dimension tables at its periphery.

Fact table: songplays -- every occurrence of a song being played is stored here.

Dimension tables:

users -- the users of the Sparkify music streaming app
songs -- the songs in Sparkify's music catalog
artists -- the artists who record the catalog's songs
time -- the timestamps of records in songplays, broken down into specific date and time units (year, day, hour, etc.)

Run Instructions

Clone this repository, which will place the 3 .py files and the .cfg file into the same directory.
Duplicate the dwh_template.cfg file to create a new file named dwh.cfg. Because this will contain private login credentials, be sure it is added to the .gitignore file.
Fill in the [CLUSTER] and [IAM_ROLE] attributes from AWS, according to the IAM role and Redshift cluster already created. Please consult AWS's well-documented instructions as necessary.
Run python create_tables.py to set up the Redshift data warehouse cluster.
Run python etl.py. This will copy the 2 large tables from S3 into staging tables. This will also populate the smaller dimension tables.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.idea		.idea
img		img
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
dwh_template.cfg		dwh_template.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparkify Data Warehouse

Project Summary

File Descriptions

create_tables.py

sql_queries.py

etl.py

Design Decisions

Keyspace Star Schema

Run Instructions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

markplotlib/sparkify-data-warehouse

Folders and files

Latest commit

History

Repository files navigation

Sparkify Data Warehouse

Project Summary

File Descriptions

create_tables.py

sql_queries.py

etl.py

Design Decisions

Keyspace Star Schema

Run Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages