The goal of this project is to learn how to make a simple article recommendation engine using a semi-recent advance in natural language processing called word2vec (or just word vectors). In particular, I have used a "database" from Stanford's GloVe project trained on a dump of Wikipedia. The project involves reading in a database of word vectors and a corpus of text articles then organizing them into a handy table (list of lists) for processing.

Around the recommendation engine, you are going to build a web server that displays a list of BBC articles for URL http://localhost:5000 (testing) or whatever the IP address is of your Amazon server (deployment):

Clicking on one of those articles takes you to an article page that shows the text of the article as well as a list of five recommended articles:

Article word-vector centroids

In a nutshell, each word has a vector of, say, 300 floating-point numbers that somehow capture the meaning of the word, at least as it relates to other words within a corpus. These vectors are derived from a neural network that learns to map a word to an output vector such that neighboring words in some large corpus are close in 300-space. ("The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning." see GloVe project.)

Two words are related if their word vectors are close in 300 space. Similarly, if we compute the centroid of a document's cloud of word vectors, related articles should have centroids close in 300 space. Words that appear frequently in a document push the centroid in the direction of that word's vector. The centroid is just the sum of the vectors divided by the number of words in the article. Given an article, we can compute the distance from its centroid to every other article's centroid. The article centroids closest to the article of interest's centroid are the most similar articles. Surprisingly, this simple technique works well as you can see from the examples above.

Given a word vector filename, such as glove.6B.300d.txt, and the root directory of the BBC article corpus, we will use the following functions from in the main file to load them into memory:

# get commandline arguments
i = sys.argv.index('server:app')
glove_filename = sys.argv[i+1]
articles_dirname = sys.argv[i+2]

The gloves variable is the dictionary mapping a word to its 300-vector vector. The articles is a list of records, one for each article. An article record is just a list containing the fully-qualified file name, the article title, the text without the title, and the word vector computed from the text without the title.

Then to get the list of most relevant five articles, we'll do this:

seealso = recommended(doc, articles, 5)

The description of those functions is in from the starter kit, but it's worth summarizing them here:

def load_articles(articles_dirname, gloves):
    Load all .txt files under articles_dirname and return a table (list of tuples)
    where each record is a list of:

      (filename, title, article-text-minus-title, wordvec-centroid-for-article-text)

    We use gloves parameter to compute the word vectors and centroid.

    The filename is stripped of the prefix of the articles_dirname pulled in as
    script parameter sys.argv[2]. E.g., filename will be "business/223.txt"
def recommended(article, articles, n):
    Return a list of the n articles (records with filename, title, etc...)
    closest to article's word vector centroid. The article is one of the elements
    (tuple) from the articles list.

Web server

Besides those core functions, I built a web server as well using flask. See the video on how to launch a flask web server at Amazon, which uses the simple flask web server not gunicorn. We need to use gunicorn because the "... Flask’s built-in server is not suitable for production as it doesn’t scale well and by default serves only one request at a time." (from the doc). See Standalone WSGI Containers for more on using flask with gunicorn. The server should respond to two different URLs: the list of articles is at / and each article is at something like /article/business/353.txt. The BBC corpus in directory bbc is organized with topic subdirectories and then a list of articles as text files:

So, if you are testing and from your laptop, you would go to the following URL in your browser to get the list of articles:


And to get to a specific article you would go to:


The localhost:5000 will be replaced with an IP address plus `:5000' or some machine name given to you by Amazon when you deploy your server.


