This project is an overview of a Quality Movies Data Analysis Pipeline that takes the imbd movies data from S3 and perform qualaity check and based on that bad records stored in a specific S3 folder for review and Passed data load in Redshift Dara Warehouse for futher analysis. Tech stacks used like S3, Glue Crawler, Glue Catalog, Glue Catalog Data Quality, Glue Low Code ETL, Redshift, Event Bridge, SNS, Step Functions etc
-
we will create a S3 bucket "movies-data-yb" with multiple folders for keeping the input data, bad data, quality check outcomes etc.
-
Upload the movies data in input_data folder "imbd_movies_ratings.csv"
-
we will create a Glue ETL Job "Movies-Data-Analysis" to do below tasks
- Perform the data check on the data with defined rules
- load failed data in bad-records S3 bucket for review
- load the Data check rule outcome in S3
- load the successfully passed data in Redshift table.
- we will create Event Rule to Trigger with the Data Quality check execution and send the output to the SNS Topic.
NOTE: before the GLue Job Run you should have the S3, Glue and Cloundwatch monitoring Endpoints created in your VPC
-
Failed data loaded inside the "bad_records" S3 bucket folder
-
Rule outcome loaded in "rule_outcome_from_etl" S3 bucket folder
-
Final passed data loaded in Redshift "imdb_movies_rating" Table
- we will create a State Machine "Movies-Data-Pipeline" using Step Function service.
This machine will run the crawler and execute the Glue Job with SNS notification on success and failure.