Skip to content

Project for STATS 503 at University of Michigan, where we were tasked to analyze a public dataset using statistical modeling techniques in R.

Notifications You must be signed in to change notification settings

victle/online-shopper-behavior

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Online Shopping Behavior That Leads to Revenue

Project for STATS 503 (Multivariate Regression) at University of Michigan, where we were tasked to analyze a public dataset using statistical modeling techniques in R.

Project Goal

Determine the features most likely to lead to a sale from the Online Shoppers Purchasing Intention dataset from UCI's ML archive. This README summarizes the major steps and insights, but the finer details will be in models.md. An Rpubs version can be found here.

Libraries

Many of the modeling techniques are taken from the following packages:

  • e1071 (for Naive Bayes and SVM)
  • class (for KNN)

Loading and Preprocessing Data

A lot of the columns have to be changed to represent a categorical variable.

image

It's important that, when doing a 70-30 split on the data, there is a balance in the the training and testing sets.

Exploratory Data Analysis

Modeling

KNN Cross-validation

image

Decision Trees

unnamed-chunk-5-1

Feature Importance

unnamed-chunk-6-1

Classification Models Used

  • Logistic Regression
    • Reduced Logistic Regression
  • Naive Bayes
  • LDA
  • SVM
  • Random Forest
  • Adaboost

Table of Model Performance

Training Error Testing Error
Naive Bayes 0.1887601 0.1837838
Logistic Regression 0.1165701 0.1132739
KNN 0.1093859 0.1091892
SVM 0.0936269 0.1027027
Random Forest 0.0000000 0.0918919
Adaboost 0.0085747 0.1035135

Comments on PageValue as a feature

PageValues is an incredibly important feature in this dataset for predicting revenue. The biggest problem here is that PageValues is computed by dividing the revenue of a shopping instance by the number of views a page gets. This inherently encodes revenue into the problem, and does not really reflect the volitional behavior of the shopper. In fact, PageValue by itself does a great job at classifying Revenue on its own (achieving about 10% error on the testing set).

About

Project for STATS 503 at University of Michigan, where we were tasked to analyze a public dataset using statistical modeling techniques in R.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published