Sparkify Data Pipeline

using Apache Airflow, AWS Redshift and AWS S3

Project Summary

In this project, I build off of a previous ETL Pipeline project, using more automated and better monitored pipelines, primarily through Apache Airflow. The data draws again from a fictitious music streaming service named Sparkify.

The pipeline channels data from Amazon Web Service's (AWS) Simple Storage Service (S3) into AWS Redshift data warehouses (in the form of staging tables). The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

The pipelines are dynamic and built from reusable tasks. They are monitored and allow for easy backfills. Data quality checks are also automated for analysis execution over the data warehouse, to catch any discrepancies in the datasets.

Apache Airflow

In Airflow, I create custom operators to perform tasks that stage the data, fill the data warehouse, and run data quality checks.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.idea		.idea
dags		dags
img		img
plugins		plugins
.DS_Store		.DS_Store
README.md		README.md
create_tables.sql		create_tables.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparkify Data Pipeline

Project Summary

Apache Airflow

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

markplotlib/airflow

Folders and files

Latest commit

History

Repository files navigation

Sparkify Data Pipeline

Project Summary

Apache Airflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages