A corpus-builder for Argos.
This is very simple: it just collects article data from a set of sources
specified in sources.json
at regular intervals. Later this data can be
processed, used for training, or w/e.
This project has the additional functionality of digesting WikiNews
pages-articles
XML dumps to build out evaluation Event clusters (see
below).
- Setup
config.py
- Run
setup.sh
- Setup the
crontab
- Activate the virtualenv and run
python main.py load_sources
to load the sources (fromsources.json
) into the database.
At some point you will probably want to move the data elsewhere for processing.
If you ssh into your machine with the database, you can get an export:
$ mongodump -d argos_corpora -o /tmp
$ tar -cvzf /tmp/dump.tar.gz /tmp/argos_corpora
From your local machine, you can grab it with scp
and then import into a local MongoDB instance.
$ scp remoteuser@remotemachine:/tmp/dump.tar.gz .
$ tar -zxvf dump.tar.gz
$ cd dump
$ mongorestore argos_corpora
It's likely though that you want to export only the training fields
(title
and text
) to a JSON for training:
$ mongoexport -d argos_corpora -c article -f title,text --jsonArray -o articles.json
The sampler
package can digest WikiNews pages-articles
XML dumps for
the purpose of assembling evaluation data.
It takes a WikiNews page with at least two cited sources and assumes that it constitutes an Event, and its sources are member articles. This data is saved to MongoDB and can later be used to evaluate the performance of the main Argos project's clustering.
You can download the latest pages-articles
dump at
http://dumps.wikimedia.org/enwikinews/latest/.
I strongly suggest you pare down this dump file to maybe only the last 100 pages, so you're not fetching a ton of articles.
To use it, run:
# Start mongodb:
$ mongod --dbpath db
# Preview how many events and articles will be created/downloaded:
# useful if you don't want to process tens of thousands of things.
$ python main.py sample_preview /path/to/the/wikinews/dump.xml
# Process the dump for reals
$ python main.py sample /path/to/the/wikinews/dump.xml
That will parse the pages, and for any page that has over two cited sources, it will fetch the article data for those sources and save everything to MongoDB.
Then you can export that data:
$ mongoexport -d argos_corpora -c sample_event --jsonArray -o ~/Desktop/sample_events.json
And this can be used in the main argos project's for evaluation.