Skip to content

markplotlib/airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airflow logo

Amazon S3

Sparkify Data Pipeline

using Apache Airflow, AWS Redshift and AWS S3

Project Summary

In this project, I build off of a previous ETL Pipeline project, using more automated and better monitored pipelines, primarily through Apache Airflow. The data draws again from a fictitious music streaming service named Sparkify.

The pipeline channels data from Amazon Web Service's (AWS) Simple Storage Service (S3) into AWS Redshift data warehouses (in the form of staging tables). The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

The pipelines are dynamic and built from reusable tasks. They are monitored and allow for easy backfills. Data quality checks are also automated for analysis execution over the data warehouse, to catch any discrepancies in the datasets.

Apache Airflow

In Airflow, I create custom operators to perform tasks that stage the data, fill the data warehouse, and run data quality checks.

flow of tasks

About

Data Pipeline and task scheduler using Apache Airflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages