Quality-Movie-Data-Analysis-Project

Project Overview

This project is an overview of a Quality Movies Data Analysis Pipeline that takes the imbd movies data from S3 and perform qualaity check and based on that bad records stored in a specific S3 folder for review and Passed data load in Redshift Dara Warehouse for futher analysis. Tech stacks used like S3, Glue Crawler, Glue Catalog, Glue Catalog Data Quality, Glue Low Code ETL, Redshift, Event Bridge, SNS, Step Functions etc

Architectural Diagram

Key Steps

1. Create a S3 bucket

we will create a S3 bucket "movies-data-yb" with multiple folders for keeping the input data, bad data, quality check outcomes etc.
Upload the movies data in input_data folder "imbd_movies_ratings.csv"

2. Create a Glue Crawler

we will create a Glue Carwler "crawl-movies-data-s3" to crawl the input data schema from S3.
Run the crawler and check the result

3. Create a Data Quality Check

we will create a Glue Data Quality Check by creating multiple rules on top of the crawled result.
Run the data quality rules and check outcome

4. Create redshift output table & Glue Crawler

we will create a output table "imdb_movies_rating" in redshift.
create a glue crawler for the ouput table

5. Create a Glue Job

we will create a Glue ETL Job "Movies-Data-Analysis" to do below tasks
- Perform the data check on the data with defined rules
- load failed data in bad-records S3 bucket for review
- load the Data check rule outcome in S3
- load the successfully passed data in Redshift table.

6. Create a Event Bridge Rule

we will create Event Rule to Trigger with the Data Quality check execution and send the output to the SNS Topic.

7. Test your Glue ETL Job

NOTE: before the GLue Job Run you should have the S3, Glue and Cloundwatch monitoring Endpoints created in your VPC

Glue job run sccessfully
Failed data loaded inside the "bad_records" S3 bucket folder
Rule outcome loaded in "rule_outcome_from_etl" S3 bucket folder
Final passed data loaded in Redshift "imdb_movies_rating" Table

8. Create a State Machine

we will create a State Machine "Movies-Data-Pipeline" using Step Function service. This machine will run the crawler and execute the Glue Job with SNS notification on success and failure.

9. Create a Event Bridge Rule

we will create a Event Rule "movies-data-pipeline-trigger" to trigger the State Machine on the csv file creation in S3 bucket.

10. Run Final End to End Pipeline

we will upload the imbd date csv file in the input folder of S3 bucket, and the Process should be triggered.
Successful Run of the State Machine
Success Notification on subscribed Email ID

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Images		Images
Input_Data		Input_Data
Glue_Job_Code.py		Glue_Job_Code.py
README.md		README.md
State_Machine_code.json		State_Machine_code.json
redshift_tables_for_imdb.txt		redshift_tables_for_imdb.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality-Movie-Data-Analysis-Project

Project Overview

Architectural Diagram

Key Steps

1. Create a S3 bucket

2. Create a Glue Crawler

3. Create a Data Quality Check

4. Create redshift output table & Glue Crawler

5. Create a Glue Job

6. Create a Event Bridge Rule

7. Test your Glue ETL Job

8. Create a State Machine

9. Create a Event Bridge Rule

10. Run Final End to End Pipeline

About

Releases

Packages

Languages

yash872/Quality-Movie-Data-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

Quality-Movie-Data-Analysis-Project

Project Overview

Architectural Diagram

Key Steps

1. Create a S3 bucket

2. Create a Glue Crawler

3. Create a Data Quality Check

4. Create redshift output table & Glue Crawler

5. Create a Glue Job

6. Create a Event Bridge Rule

7. Test your Glue ETL Job

8. Create a State Machine

9. Create a Event Bridge Rule

10. Run Final End to End Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages