Skip to content

This project uses exploratory data analysis or EDA to scrape movie data from the internet. The ETL process cleans files, merges dataframes, and publishes to a SQL database. This code creates functions, uses list comprehension, and utilizes regular expressions to clean the web data. After cleaning the data step by step, the code was refactored to…

Notifications You must be signed in to change notification settings

emaynard10/Movies-ETL

Repository files navigation

Movies-ETL

Create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. Refactor the code to create one function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data—and performs the ETL process by adding the data to a PostgreSQL database.

Tools: Python, Pandas, ETL, postgres, SQL, regex

Overview

Scrape movie data from the internet to extract-transform-load the datasets. Clean files, merge dataframes, and publish to an SQL database. This code creates functions, uses list comprehension, and utilizes regular expressions to clean the web data. After cleaning the data step by step, the code was refactored to take three files in a function order to automate the pipeline. The beginning of the function loads in the three files and makes the first datat frame using the following code:

def data_function(wiki_file, kaggle_file, ratings_file):
    kaggle_metadata = pd.read_csv(kaggle_file, low_memory=False)
    ratings = pd.read_csv(ratings_file)
    
    # Open the read the Wikipedia data JSON file.
    with open(f'{file_dir}/Resources/wikipedia-movies.json', mode='r') as file:
        wiki_movies_raw = json.load(file)
    
    # 3. Write a list comprehension to filter out TV shows.
    wiki_movies = [movie for movie in wiki_movies_raw if 'No. of episodes' not in movie]

    # 4. Write a list comprehension to iterate through the cleaned wiki movies list
    # and call the clean_movie function on each movie.
    clean_movies = [clean_movie(movie) for movie in wiki_movies]


    # 5. Read in the cleaned movies list from Step 4 as a DataFrame.
    wiki_movies_df = pd.DataFrame(clean_movies)

The fUnction continues with a try/except block to find errors while extracting the ID on which the datasets will be merged further down the codeblock. regular expressions are used to clean the data, data is dropped and the best data is replaced using either the wiki files or the kaggel data. The code then merges the dataframes and chooses the best data. In order to sort through the data, regular expressions are used. The ratings dataset is so large "chunks" are created to send the data to Posgresql. Finally, the code is loaded to a SQL database with two tables using the following code:

 db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5432/movie_data"
    engine = create_engine(db_string)
    movies_df.to_sql(name='movies', con=engine, if_exists='replace')

The ETL process is important for ensuring the good data is preserved as much as possible while bad data is eliminated, in the data cleaning process the analyst much choose which data to keep and which data does not add to the analysis. The data needs to be transformed to a more useable format and the three dataframes are merged at the end for comparisons. Refactoring the code and creatign functions improves the process

Contact

Contact Emily

About

This project uses exploratory data analysis or EDA to scrape movie data from the internet. The ETL process cleans files, merges dataframes, and publishes to a SQL database. This code creates functions, uses list comprehension, and utilizes regular expressions to clean the web data. After cleaning the data step by step, the code was refactored to…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published