NLP in Journalism Workshop at PyDays
- Clone this repo with
[email protected]:shangyian/nlp-journalism-workshop.git
. - Create a virtualenv environment with
virtualenv env
. - Activate it with
source env/bin/activate
- Install Python library requirements with
pip install -r requirements.txt
. - Install Redis. Start the server with
redis-server
.
We are using the Vox articles dataset, which contains all articles published on Vox.com before March 2017.
You can download the dataset (in TSV format) from https://data.world/elenadata/vox-articles). Copy this into the data/
directory, so that we have data/vox_Articles.tsv
.
Then we'll want to load and clean the data. In general, this involves:
- removing HTML tags
- removing stop words
- tokenizing
- stemming
In order for the Flask API to work, we'll need to build a SQLite database with our articles. To do this, run python main.py --load_from ./data/vox_Articles.tsv
.
Once you’ve loaded the data into SQLite and set up Redis, we can run the API, which lets us see the data in a more organized fashion: python api.py
. The API should be running on http://0.0.0.0:8000/
.
We can test that it’s up with http://0.0.0.0:8000/articles
, which should return a list of article ids from the database that you can query. You can also pick one of the article ids and try http://0.0.0.0:8000/articles/<article_id>
, which will output specific data about that article.