Skip to content

Predictive models using Reddit comment data from October 2018 to January 2019 to identify whether or not a post is `stickied` or contains a `no_follow` link using SparkML.

Notifications You must be signed in to change notification settings

aquevedo93/Reddit_SparkML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Predictive Analysis using Reddit Comment Data and SparkML

Big Data project using Reddit comment data from October 2018 to January 2019 (15 features and 433,521,422 observations) to develop predictive models to identify whether or not a a post is stickied or contains a no_follow link.

  • Stickied posts are posts that are highlighted and stay at the top of the subreddit (usually one or two posts)
  • Nofollow links are those that don't allow search engine bots to follow them.

Being classification problems, Logistic Regression, Random Forest, and Gradient Boosted Tree models were used including predictors reflecting post relevance, user involvement, and post characteristics.

Spark was used throughout the different steps of the project given that it is performant in terms of running time and developer productivity (important since we are dealing with 500GB of data) and it enables iteration, which is essential to perform the machine learning tasks required in this assignment.

Files Included

  • project_aq38.ipynb:

    • Data Cleaning
    • Exploratory Data Analysis
    • Data Visualization
    • Machine Learning Models
  • Project.txt: Writeup including analysis and future work

About

Predictive models using Reddit comment data from October 2018 to January 2019 to identify whether or not a post is `stickied` or contains a `no_follow` link using SparkML.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published