Benchmarking Various Algorithms for Review RatingPrediction and Sentiment Analysis

This is the readme file for the bachelor's thesis written by Sebastian Radu Herman, 01404790.

Structure

There are three main folders in this repository:

Benchmark Results: this folder contains excel sheets for all the benchmarks we performed. Inside the folder you will find 2 excel sheets called 2 ratings total and 3 ratings total which contain all the classification reports from all classifiers combined into one sheet. In the two folders, you can find individual results grouped by the type of text engineering method we used. The files are named accordingly to their content.
Dataset: this folder contains the two datasets we have used for our project. One of them contains the data as it was extracted from the database, the other the processed text.
Notebooks:

Scikit: this folder contains two subfolders, each containing the notebooks for 2 rating and 3 rating classifications. These two folders contain each 6 folders, each subfolder containing the notebook which was used for classifying the reviews according to the folders description. For example simple_tfidf stands for TFIDF + unigram.
Text Preprocessing: this folder contains the Jupyter Notebook which was used for text pre-processing.

Sql Queries: this folder contains the SQL script that was used for extracting the data out of the database provided by Prof. Bing Liu.

Prerequisites

Python version 3.5 or higher, preferably installed using Anaconda
Jupyter Notebook or Lab
Python Packages: string, re, pandas, contractions, NLTK, scikit-learn, matplotlib, seaborn, wordcloud

Running the Notebooks

Running the notebooks is quite straight forward. In both cases, by that I mean the text preprocessing notebook and the classifier notebooks, the only thing that has to be changed is the source of the csv file which can be found in the first few lines of code. After changing the path to the correct path according to your machine, you should be able to run it without issues. However, if desired to be run, I recommend duplicating the file and running it again. This is because it takes quite a lot of time to run the classifier notebooks and those already provided by me contain my results and if they are being re run, the outputs will disappear.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Benchmark Results Excel Sheets		Benchmark Results Excel Sheets
Dataset		Dataset
Notebooks		Notebooks
Sql Queries		Sql Queries
.DS_Store		.DS_Store
Data Analysis.ipynb		Data Analysis.ipynb
Herman_thesis.pdf		Herman_thesis.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Various Algorithms for Review RatingPrediction and Sentiment Analysis

Structure

Prerequisites

Running the Notebooks

About

Releases

Packages

Languages

sebastianherman/bachelors

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Various Algorithms for Review RatingPrediction and Sentiment Analysis

Structure

Prerequisites

Running the Notebooks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages