"Nice laser pointer. Bought it almost a year ago now and it still shines just as brightly as it did the first day. My cat loves it!"
Rating: 5.0/5.0
"This shredder is extremely loud. Good for about 4-5 papers at a time. You may want to look for something better, or let me know if you want to buy mine :("
Rating: 2.0/5.0
- This project demonstrates a complete text classification ML pipeline using Amazon reviews from the Office Products category. The primary goal is to predict the sentiment (positive or negative) of a given product review. Below is a high-level overview of each step, from data acquisition to model comparison.
- Overview
- Installation
- Data Preparation
- Data Exploration & Visualization
- Data Preprocessing
- Modeling
- Results Comparison
- How to Run
- Acknowledgments
- Data Source: Amazon Reviews (Office Products category)
- Techniques: Data cleaning, TF-IDF feature extraction, and multiple machine learning classifiers (Perceptron, SVM, Logistic Regression, and Multinomial Naive Bayes)
- Goal: Predict whether a review is positive (rating > 3) or negative (rating < 3)
- The Amazon Office Products Reviews are downloaded and stored locally.
- The file is then read into a DataFrame, keeping only the
review_bodyandstar_ratingcolumns.
- Convert ratings to integers and remove invalid or missing entries.
- Results in a cleaned DataFrame with only valid reviews (ratings between 1 and 5).
- Randomly sample and print a few reviews.
- Calculate and display:
- Value counts of each rating
- Percentages, mean, median, and standard deviation of ratings
- Plot the distribution of star ratings.
This step provides an understanding of the dataset’s rating distribution and typical review content.
- Ratings > 3 → 1 (positive)
- Ratings < 3 → 0 (negative)
- Ratings = 3 → 2 (neutral, which is then removed for binary classification)
- Sample 100,000 positive and 100,000 negative reviews to ensure class balance.
- Convert text to lowercase.
- Expand contractions.
- Remove HTML tags, URLs, non-alphabetical characters, and extra spaces.
- Remove stopwords (e.g., “the”, “and”, “a”).
- Lemmatize words (e.g., “builds” → “build”).
The resulting cleaned text is stored in a new column.
- Stratified 80/20 split to maintain balanced class proportions.
- Convert cleaned text into a matrix of TF-IDF features.
Multiple machine learning classifiers are trained and compared:
- Perceptron
- A linear classifier that updates weights on misclassifications.
- Support Vector Machine (SVM)
- Uses a linear kernel to find the best separating hyperplane with maximum margin.
- Logistic Regression
- Fits a sigmoid (logistic) function to predict probabilities for each class.
- Multinomial Naive Bayes
- Probabilistic approach assuming features are conditionally independent.
- Evaluation functions calculate and print classification metrics.
- Visualization functions display a comparison chart of metrics for all models on both training and testing sets.
After training, each model’s performance is evaluated on both training and testing data. The following metrics are computed and compared:
Definition: Accuracy measures the proportion of correctly predicted instances (both positive and negative) out of the total instances.
Formula:
Use Case: Useful when the dataset is balanced (similar numbers of positive and negative instances).
Definition: Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive (i.e., of all the postive predictions made by the model, what percentage of them are truly postive?).
Formula:
Use Case: Important in scenarios where minimizing false positives is critical (e.g., spam detection).
Definition: Recall (or Sensitivity) measures the proportion of correctly predicted positive instances out of all actual positive instances (i.e., out of all the truly postive instances that the model was tested on, what percentage of them did the model correctly identify as positive?).
Formula:
Use Case: Crucial in scenarios where minimizing false negatives is important (e.g., disease detection).
Definition: F1-Score is the harmonic mean of Precision and Recall, providing a single metric that balances both.
Formula:
Use Case: Useful when there is an imbalance between classes and you want a trade-off between Precision and Recall.
These allow quick comparison of how well each classifier performs in predicting sentiment on unseen reviews.