using Apache Airflow, AWS Redshift and AWS S3
In this project, I build off of a previous ETL Pipeline project, using more automated and better monitored pipelines, primarily through Apache Airflow. The data draws again from a fictitious music streaming service named Sparkify.
The pipeline channels data from Amazon Web Service's (AWS) Simple Storage Service (S3) into AWS Redshift data warehouses (in the form of staging tables). The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.
The pipelines are dynamic and built from reusable tasks. They are monitored and allow for easy backfills. Data quality checks are also automated for analysis execution over the data warehouse, to catch any discrepancies in the datasets.
In Airflow, I create custom operators to perform tasks that stage the data, fill the data warehouse, and run data quality checks.