Author: Erin James Wills - [email protected]
Photo by Mika Baumeister on Unsplash
Using Spark and Amazon RDS to clean and summarize amazon reviews to determine usefulness of product feedback.
This repository does the following:
- Performs an ETL process and loads data to an AWS Postgres database. Detailed instructions are found in the
amazon_ETL.ipynb
notebook. - Shows how to query the database
- Extracts data from the 30+ Amazon datasets make a table of all the data. This process may take up to 4.5 hours.
- Analyzes the vine reviewer difference compared to other reviewers.
Looking only at the Health and Personal Product dataset, it appears that Vine reviewer star ratings are no different than other members. The only significant difference is the effect of splitting the data based on the product purchase verfication status. This is a small sample size so this result is not being considered significant.
More analysis will be performed within the amazon_advance_anlaysis.ipynb
. Instead of looking at the macro scale results; I am curious if I can find several groupings of data that may make up similar reviewers or determine if I can standardize the data such that one users ratings are standardized to be more like other members rankings.
- Google Colaboratory
- pySpark
- Amazon RDS
- indendent t-test
- Wilcoxon Rank test
- Histograms
All datasets can be found at https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
- Clone the repo to your local machine
- Load
amazon_ETL
to Google Colaboratory - Login to Amazon Console and create an AWS RDS PostgreSQL.
- Make the database public
- Name the database
amazon_rds
- Record the db endpoint, username and password.
- Make sure the database is running and if not, start the database.
- Add the recorded information to
config.py
- Open pgAdman and create a server connection. Use the recorded information to create the database connection.
- Open a query tool and run the
schema.sql
file to create the database tables. - Verify that the tables were created in pgAdmin.
- Run the notebook
amazon_ETL
in Google Colab. Select theconfig.py
file when prompted to load a file. - No other changes are probably needed and should take less than 30 minutes to run.
- In the cloned repo is also several files used to analyze the data.
- The
amazon_vine_users
is a notebook that extracts review data from all the amazon datasets. This notebook includes information about the total reviews, the number of vine reviews, and the vine reviewers. - The
amazon_query
notebook shows how to connect to the database and perform queries. - The
amazon_vine_program_analysis
does a simple analysis of whether vine members review content different than non-vine members. - Additional analysis is being performed in the
amazon_advance_analysis
notebook. This is ongoing work where I look at specific reviewer behavior.
Note: Remember to shutdown and delete any services so not to incur any service fees.
Example of the review types for all datasets: