This Capstone Project is for the comprehensive hands on experience of the concepts which includes data modeling, creating a dataware house etc.
In this project two data sources are used (netflix movies data, imdb movies data). The primary purpose of this analysis is answer the below questions
- Ratio of movies and TV shows on Neflix
- Top 10 movies category on Netflix
- Top 10 countries having maximum Movies/TV shows on netflix
- Year wise movies release on Netflix
- Top 20 actors having the most Movies/TV shows
- Top 20 Directors having the most Movies/TV shows
- Top 10 highest rated comedies on Netflix
- etc.
The data is transformed to star schema for data modeling to efficiently perform the analysis on netflix movies data.
A dataware house is created in Amazon redshift using python. Netflix movies data has been downloaded from kaggle and imdb datasets are taken from imdb official datasets. The data is kept on s3 to copy into redshift. This database includes 7 dimension tables, 5 fact tables and 3 stagging tables.
This project has the following files
File | Description |
---|---|
create_tables.py | Drop and create tables which are defined in the sql_queries.py |
sql_queries.py | DDL scripts of all the tables and insertion and selection scripts |
etl.py | Contains all the etl operations to fill the dimension and facts table |
Analytical_Queries.ipynb | Contains all the sample analytical queries with graph plots |
create_aws_resources.py | To create resources on AWS (e.g IAM role, redshift clustor, security group) |
delete_aws_resources.py | Delete resources from AWS to avoid over billing |
dwh.cfg | Contains all the configurations like (AWS key/Secret, Redshift related configurations, s3 bucket paths etc.) |
The purpose of this database is to design a dataware house in such a way that the above questions can be answered efficiently.
In this project data is gathered from two sources netflix data and imdb data to better analyse the movies on netflix. Moreover the purpose of the incorporating imdb data is to analyze the movies with respect to imdb ratings.
The dataware house is maintained in AWS Redshift which is optimized for fast read operations and can manage heavy workloads very well.
Right now a fixed datasource is supplied from s3 and etl process will run on that whenever the datasource will be changed etl process will need to be run to provide the updated data on the analytical process.
The database on aws redshift could be scaled up to multiple nodes if the 100+ people try to access it.
As the imdb is very bulk as compare to the netlix data and there are multiple duplications in the imdb dataset to match the data in bases of movies title. so to narrow down the scope of the project only those movies kept in the stagging table whose release year is greater then 2000 and only movies and tvSeries data is considered for anlysis.
This database has the star schema by design which includes
7 dimension tables:
- dim_type
- dim_country
- dim_date
- dim_director
- dim_actor
- dim_movie_title
- dim_category
5 fact tables:
- fact_movies_added_on_netflix
- fact_movies_direction_on_netflix
- fact_movies_actor_on_netflix
- fact_movies_country_on_netflix
- fact_movies_category_on_netflix
3 stagging tables:
- stagging_netflix_movies_data
- stagging_imdb_movies_data
- stagging_imdb_movies_rating_data
pip install -r requirements.txt
# -First set the aws configurations in the dwh.cfg file
python create_aws_resources.py
In order to run etl pipline run the following commands
# -Set the reshift related configuration in the dwh.cfg file
python create_tables.py
python etl.py
# -Delete all the aws resources
python delete_aws_resources.py
Example analytical queries has been enlisted in the Analytical_Queries.ipynb