Skip to content

m-doru/tweets-binary-emoji-prediction

Repository files navigation

Tweets classification - EPFL CS-433

This repo contains code and instructions necessary to classify tweets as containing ':)' or ':('. The corresponding kaggle competition was part of CS-433 Machine learning class from EPFL. Our team is Martian Jaggirnauts

Directory Tree description

data folder which should be populated as decribed below

slang_dict_parsing contains code which scrapped noslang website for slang words but did not result in accuracy improvements so it is not used

src folder containing the main code as run.py and the models.

templates_course the code provided by default in the project

Design decisions

How to run the project and TRAIN the models

*nix friendly guide. For other platforms some steps might differ

Running time The current model took around 12-hours to train on a 8-core CPU, 60GB of RAM and a Tesla K80 GPU. The GPU is highly recommended.

  1. Clone this repo
$ git clone https://github.com/m-doru/tweets-sentiment-analysis.git
$ cd tweets-sentiment-analysis
  1. Install fastText v0.1.0 with build for Python. This should be possible after this step:
$ python3
>> import fasttext
>>
  1. Clone sent2vec at the root directory of the project. Follow the Setup&Requirments to compile it. Then download the sent2vec_twitter_bigrams 23GB (700dim, trained on english tweets) v1 embeddings and place them in data/

  2. Download Glove Twitter pretrained word-vectors glove.twitter.27B.zip. Unzip file and place glove.twitter.27B.200d.txt in data/glove/

  3. Download the data from the kaggle competition and place the .txt files in data/twitter-datasets/.

  4. Install the following python requirements:

  • scikit-learn
  • keras with tensorflow backend

How to run the project to get the pretrained model the kaggle submission

  1. Clone this repo
$ git clone https://github.com/m-doru/tweets-sentiment-analysis.git
$ cd tweets-sentiment-analysis
  1. Download the data from the kaggle competition and place the .txt files in data/twitter-datasets/.

  2. Install the following python3 requirements:

  • scikit-learn
  1. Run run_pretrained.py