nlp-journalism-workshop

NLP in Journalism Workshop at PyDays

Setup

Clone this repo with [email protected]:shangyian/nlp-journalism-workshop.git.
Create a virtualenv environment with virtualenv env.
Activate it with source env/bin/activate
Install Python library requirements with pip install -r requirements.txt.
Install Redis. Start the server with redis-server.

Loading dataset

We are using the Vox articles dataset, which contains all articles published on Vox.com before March 2017. You can download the dataset (in TSV format) from https://data.world/elenadata/vox-articles). Copy this into the data/ directory, so that we have data/vox_Articles.tsv.

Then we'll want to load and clean the data. In general, this involves:

removing HTML tags
removing stop words
tokenizing
stemming

In order for the Flask API to work, we'll need to build a SQLite database with our articles. To do this, run python main.py --load_from ./data/vox_Articles.tsv.

Once you’ve loaded the data into SQLite and set up Redis, we can run the API, which lets us see the data in a more organized fashion: python api.py. The API should be running on http://0.0.0.0:8000/.

We can test that it’s up with http://0.0.0.0:8000/articles, which should return a list of article ids from the database that you can query. You can also pick one of the article ids and try http://0.0.0.0:8000/articles/<article_id>, which will output specific data about that article.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
complete		complete
data		data
exercise		exercise
.gitignore		.gitignore
NLP in Journalism.pdf		NLP in Journalism.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-journalism-workshop

Setup

Loading dataset

Similarity

TF-IDF

Word2Vec/Article2Vec

About

Releases

Packages

Languages

shangyian/nlp-journalism-workshop

Folders and files

Latest commit

History

Repository files navigation

nlp-journalism-workshop

Setup

Loading dataset

Similarity

TF-IDF

Word2Vec/Article2Vec

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages