GitHub - emaynard10/Movies-ETL: This project uses exploratory data analysis or EDA to scrape movie data from the internet. The ETL process cleans files, merges dataframes, and publishes to a SQL database. This code creates functions, uses list comprehension, and utilizes regular expressions to clean the web data. After cleaning the data step by step, the code was refactored to take three files in a function order to automate the pipeline.

Movies-ETL

Create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. Refactor the code to create one function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data—and performs the ETL process by adding the data to a PostgreSQL database.

Tools: Python, Pandas, ETL, postgres, SQL, regex

Overview

Scrape movie data from the internet to extract-transform-load the datasets. Clean files, merge dataframes, and publish to an SQL database. This code creates functions, uses list comprehension, and utilizes regular expressions to clean the web data. After cleaning the data step by step, the code was refactored to take three files in a function order to automate the pipeline. The beginning of the function loads in the three files and makes the first datat frame using the following code:

def data_function(wiki_file, kaggle_file, ratings_file):
    kaggle_metadata = pd.read_csv(kaggle_file, low_memory=False)
    ratings = pd.read_csv(ratings_file)
    
    # Open the read the Wikipedia data JSON file.
    with open(f'{file_dir}/Resources/wikipedia-movies.json', mode='r') as file:
        wiki_movies_raw = json.load(file)
    
    # 3. Write a list comprehension to filter out TV shows.
    wiki_movies = [movie for movie in wiki_movies_raw if 'No. of episodes' not in movie]

    # 4. Write a list comprehension to iterate through the cleaned wiki movies list
    # and call the clean_movie function on each movie.
    clean_movies = [clean_movie(movie) for movie in wiki_movies]


    # 5. Read in the cleaned movies list from Step 4 as a DataFrame.
    wiki_movies_df = pd.DataFrame(clean_movies)

The fUnction continues with a try/except block to find errors while extracting the ID on which the datasets will be merged further down the codeblock. regular expressions are used to clean the data, data is dropped and the best data is replaced using either the wiki files or the kaggel data. The code then merges the dataframes and chooses the best data. In order to sort through the data, regular expressions are used. The ratings dataset is so large "chunks" are created to send the data to Posgresql. Finally, the code is loaded to a SQL database with two tables using the following code:

 db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5432/movie_data"
    engine = create_engine(db_string)
    movies_df.to_sql(name='movies', con=engine, if_exists='replace')

The ETL process is important for ensuring the good data is preserved as much as possible while bad data is eliminated, in the data cleaning process the analyst much choose which data to keep and which data does not add to the analysis. The data needs to be transformed to a more useable format and the three dataframes are merged at the end for comparisons. Refactoring the code and creatign functions improves the process

Contact

Contact Emily

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Resources		Resources
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
README.md		README.md
movie_cleaning.ipynb		movie_cleaning.ipynb
wikipedia-movies.json		wikipedia-movies.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources

Resources

ETL_clean_kaggle_data.ipynb

ETL_clean_kaggle_data.ipynb

ETL_clean_wiki_movies.ipynb

ETL_clean_wiki_movies.ipynb

ETL_create_database.ipynb

ETL_create_database.ipynb

ETL_function_test.ipynb

ETL_function_test.ipynb

README.md

README.md

movie_cleaning.ipynb

movie_cleaning.ipynb

wikipedia-movies.json

wikipedia-movies.json

Repository files navigation

Movies-ETL

Tools: Python, Pandas, ETL, postgres, SQL, regex

Overview

Contact

About

Releases

Packages

Languages

emaynard10/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies-ETL

Tools: Python, Pandas, ETL, postgres, SQL, regex

Overview

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages