This repository contains a Naive Bayes classifier implemented on document classification which is completed on CSCI 8360, Data Science Practicum at the University of Georgia, Spring 2018.
This project uses the Reuters Corpus, a set of news stories split into a hierarchy of categories, but only specifies four categories as follows:
- CCAT: Corporate / Industrial
- ECAT: Economics
- GCAT: Government / Social
- MCAT: Markets
For those documents with more than one categories (which usually happen), we regard them as if we observed the same document once for each categories. For instance, for the document with CCAT and MCAT categories, we duplicate the document, and pair one of them with CCAT and one of them with MCAT.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
- Python 3.6
- Apache Spark 2.2.1
- Pyspark 2.2.1 - Python API for Apache Spark
- Google Cloud Platform
- Anaconda - packages manager for nltk, string
Anaconda is a complete Python distribution embarking automatically the most common packages, and allowing an easy installation of new packages.
Download and install Anaconda (https://www.continuum.io/downloads).
The environment.yml
file for conda is placed in Extra for your ease of installation
Download the latest, pre-built for Hadoop 2.6, version of Spark.
- Go to http://spark.apache.org/downloads.html
- Choose a release (prendre la dernière)
- Choose a package type: Pre-built for Hadoop 2.6 and later
- Choose a download type: Direct Download
- Click on the link in Step 4
- Once downloaded, unzip the file and place it in a directory of your choice
Go to WIKI tab for more details of running IDE for Pyspark. (IDE Setting for Pyspark)
To Download NLTK stopwords on GCP First
$ pip install nltk
Once installed you'll have to Download the stopwords file
$ python
Once python prompt has started
>>> import nltk
>>> nltk.download()
This will start a UI or a Command prompt input based instructions
Type d and enter and type stopwords this will initiate a download
Best way would be python -m nltk.downloader -d /usr/local/share/nltk_data popular
Search for the sopword button and press download
Post this please make sure that all nltk files are in /usr/local/share
You can run p1.py
via regular python or run the script via spark-submit. You should specify the path to your spark-submit.
$ python p1.py [training-directory] [testing-directory] [optional args]
$ usr/bin/spark-submit p1.py [training-directory] [testing-directory] [optional args]
The output file pred_test_<size>.txt
can be customized by the size you selected, and saved to directory you specified. The required and optional arguments are as follows:
-
Required Arguments
-
ptrain
: Directory contains the input training files -
ptest
: Directory contains the input testing files
-
-
Optional Arguments
-
-s
: Sizes to the selected file. (Default:vsmall
)vsmall
for very small dataset,small
for small dataset, andlarge
for large dataset. The output file will be connected with this selected size, e.g.pred_test_vsmall.txt
. -
-o
: Path to the output directory where outputs will be written. (Default: root directory) -
-a
: Accuracy of the testing prediction. (Default:True
)The options gives you the accuracy of the prediction. If the file of testing label does not exist, it will still output the file but print out
Accuracy is not available!
.
-
After Splitting the document content, we implement punctuation stripping and words stemming by several python APIs. There are some brief explanations about the packages and more details in the WIKI tab.
import string
PUNC = string.punctuation
Import the string
package, then the string.punctuation
gives you the list of all punctuation.
We remove all punctuation before and after the word, but ignore those punctuation between two words, e.g. happy--newyear
, super.....bowl
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
SW = stopwords.words('english')
Import the stopwords
under nltk.corpus
, then stopwords.words('english')
gives you the stopwords in English. Notice that you should import nltk
first, and download stopwords
from it before importing it from corpus.
We remove those stopwords that might confuse our classifier of the important words, e.g. the
, is
, you
.
import nltk
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
word = lancaster_stemmer.stem(word)
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
word = ps.stem(word)
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()
word = wnl.lemmatize(word)
We have tried three stemming packages from nltk.stem
in this project. The examples of each stemming packages (Lemmatizer, Lancaster, Porter) are introduced in the WIKI tab. Notice that you have to download wordnet
before importing it.
After implementing words stemming, all the words are transferred to their stems, e.g. cars
, car's
, car
all become car
.
This project mainly uses Naive Bayes classifier with several preprcessing methods. There is a brief flow of what we did:
- Words Splitting by white space (Optional: Replace double hyphens
--
by white space) - Words Tokenizing
- Punctuations Removing - remove punctuations before and after the words
- Words Stemming - Lancaster, Porter, or Lemmatizer stemming methods
- Punctuations Removing again
- Stopwords Removing
- Naive Bayes Classifier
- Prediction
See more details of each section in WIKI tab.
We expressed the data structure inside RDD for each stage as follows:
- Input data with label
RDD([(doc_id_0, document_0, label_0),
(doc_id_1, document_1, label_1), ...])
- Training set after preprocessing
RDD([((label_0, word_0), (word_count_0, word_total_count_in_label_0)),
((label_0, word_1), (word_count_1, word_total_count_in_label_0)), ...])
- Training set after NB classifier
rdd_train = [rdd_train_labword_cp, rdd_train_lab_cp0, rdd_train_lab_pp]
rdd_train_labword_cp
= RDD([((label_0, word_0), cond_prob_of_word_0),
((label_0, word_1), cond_prob_of_word_1), ...])
rdd_train_lab_cp0
= RDD([(label_0, cond_prob_of_count0_in_label_0),
(label_1, cond_prob_of_count0_in_label_1), ...])
rdd_train_lab_pp
= RDD([(label_0, prior_prob_in_label_0),
(label_1, prior_prob_in_label_1), ...])
- Testing set before prediction
RDD([((label_0, word_0), doc_id_0),
((label_0, word_1), doc_id_0), ...])
- Training and Testing set during prediction
RDD([((doc_id_0, label_0), (cond_prob_0, prior_prob_0)),
((doc_id_0, label_0), (cond_prob_1, prior_prob_0)), ...])
You can read the script p1.py to know more details of how the formats work for NB classifier by the comments we left in the codes.
- Join Label and Content
rdd = rdd_train_data.join(rdd_train_label).map(lambda x: x[1])
Join two rdds that have been zipped with index to link labels with documents
- Duplicate Documents with multiply labels
rdd = rdd.map(lambda x: (x[0], x[1].split(',')))
rdd = rdd.flatMapValues(lambda x: x).filter(lambda x: 'CAT' in x[1]).map(lambda x: (x[1],x[0]))
- Tokenize Words and Remove """
def tokenize_words(no_quot_words):
no_quot_words = no_quot_words.split(""") #.replace(".", " ").replace("--", " ")
new = []
for item in no_quot_words:
new.extend(item.split(" "))
return new
We reomve """ by replacing it with space. Therefore, words can be easily splitted by .split(" ")。 We also did experments on removing "." and/or "--".
- Remove Punctuation
def remove_punctuation_from_end(word):
punctuation = PUNC.value
if len(word)>0 and word[0] in punctuation:
word = word[1:]
if len(word)>0 and word[-1] in punctuation:
word = word[:-1]
return word
def check_punctuation(word):
punctuation = PUNC.value
while len(word)>0 and (word[0] in punctuation or word[-1] in punctuation):
word = remove_punctuation_from_end(word)
return word
This part only removes all the punctuations that's either in the end or in the beginning. But it will not remove punctuation inside a word, preventing words like "we're" to be splitted.
- Stemming
def cleanup_word(word):
w = check_punctuation(word)
lancaster_stemmer = LancasterStemmer()
# ps = PorterStemmer()
wnl = WordNetLemmatizer()
# w = wnl.lemmatize(w.lower())
# w = ps.stem(w.lower())
# w = ps.stem(wnl.lemmatize(w.lower()))
w = lancaster_stemmer.stem(wnl.lemmatize(w.lower()))
w = check_punctuation(w)
return w
This part is mainly for stemming. But you'll notice we did check and/or remove punctuaction twice. The former one is for cleaning the word for the use of stemmer. If a word like cars.
has punctuaction with it, the stemmer will not stem the word. If the word is cleaned, namely if it becomes cars
, the stemmer will then stem the word, making it to be car
, which is what we want.
Experiments that we did here are using different or combination of stemmers and lemmatizer.
For each label, we kept those words not in the label but in other labels with count 0. Conditional probability of word i, given label k, are calculated by the word count in the label k divided by the total word count in the label k with laplace smoothing.
For example, the conditional probability of word happy
, given the catogory CCAT
is calculated by following equation:
where the value V is the distinct amount of words in training data without considering the label. More details about naive Bayes theory and laplace smoothing are in WIKI tab.
The prediction of each document in testing data are selected by the category with largest value of sum of conditional probabilities and prior probability after log transformation. For example, in document 1 of vsmall
data, we'll have to calculate following values of each category:
Since we got -436.92 for category MCAT, -429.91 for category CCAT, -447.68 for category GCAT, and -441.24 for category ECAT, then we assigned CCAT for this document 1. Once the predicted category is one of the category of the document category list, we regarded it as success prediction. The classifier will automatically output a prediction list file (pred_test_vsmall.txt
) and prediction accuracy if the testing label file exists.
We tried several different situations in preprocessing section and the results are as follows:
Tokenizing | Stemming | Accuracy |
---|---|---|
Remove double hyphens | Lemmatizer | 94.51% |
Remove double hyphens | Lemmatizer + Porter | 94.21% |
Lemmatizer + Lancaster | 94.04% | |
Porter | 94.19% | |
Lemmatizer | 94.52% | |
Remove dots | Lemmatizer | 94.28% |
Therefore, we recommend using only Lemmatizer words stemming.
Since Naive Bayes classifier considers count for calculating the probabilities, it is tricky to implement TF-IDF in NB classifier. However, TF-IDF (Term Frequency Inverse Document Frequency) is reasonable to scale important words in each category. To improve this classifier, we expect to further the project by implementing TF-IDF to Logistic Regression classifier or K Nearest Neighbor classifier.
You might encounter different issues when running this classifier on local machine and Google Cloud Platform for the first time. See the following list of issues and solutions or more other issues are documented in the ISSUES tab:
Local Machine (Windows)
- [Pycharm] Error: too many values to unpack
- [Pyspark] Error: cannot find the file specified
- Worker and Driver has different version
Google Cloud Platform
- ImportError: No module named nltk
- ImportError: No module named nltk.stem.wordnet
- Resource wordnet not found
- Name node is in safe mode
Some issues were resolved using the help of the almighty Stackoverflow.
[1] Naive Bayes Text Classification - Stanford University
[2] A Review of Machine Learning Algorithms for Text-Documents Classification
- Weiwen Xu - WeiwenXu21
- I-Huei Ho - melanieihuei
- Nihal Soans - nihalsoans91
See the CONTRIBUTORS file for details.
This project is licensed under the MIT License - see the LICENSE.md file for details