argos.corpora

A corpus-builder for Argos.

This is very simple: it just collects article data from a set of sources specified in sources.json at regular intervals. Later this data can be processed, used for training, or w/e.

This project has the additional functionality of digesting WikiNews pages-articles XML dumps to build out evaluation Event clusters (see below).

Setup

Setup config.py
Run setup.sh
Setup the crontab
Activate the virtualenv and run python main.py load_sources to load the sources (from sources.json) into the database.

Exporting the database

At some point you will probably want to move the data elsewhere for processing.

If you ssh into your machine with the database, you can get an export:

$ mongodump -d argos_corpora -o /tmp
$ tar -cvzf /tmp/dump.tar.gz /tmp/argos_corpora

From your local machine, you can grab it with scp and then import into a local MongoDB instance.

$ scp remoteuser@remotemachine:/tmp/dump.tar.gz .
$ tar -zxvf dump.tar.gz
$ cd dump
$ mongorestore argos_corpora

It's likely though that you want to export only the training fields (title and text) to a JSON for training:

$ mongoexport -d argos_corpora -c article -f title,text --jsonArray -o articles.json

The Sampler Package

The sampler package can digest WikiNews pages-articles XML dumps for the purpose of assembling evaluation data.

It takes a WikiNews page with at least two cited sources and assumes that it constitutes an Event, and its sources are member articles. This data is saved to MongoDB and can later be used to evaluate the performance of the main Argos project's clustering.

You can download the latest pages-articles dump at http://dumps.wikimedia.org/enwikinews/latest/.

I strongly suggest you pare down this dump file to maybe only the last 100 pages, so you're not fetching a ton of articles.

To use it, run:

# Start mongodb:
$ mongod --dbpath db

# Preview how many events and articles will be created/downloaded:
# useful if you don't want to process tens of thousands of things.

$ python main.py sample_preview /path/to/the/wikinews/dump.xml

# Process the dump for reals
$ python main.py sample /path/to/the/wikinews/dump.xml

That will parse the pages, and for any page that has over two cited sources, it will fetch the article data for those sources and save everything to MongoDB.

Then you can export that data:

$ mongoexport -d argos_corpora -c sample_event --jsonArray -o ~/Desktop/sample_events.json

And this can be used in the main argos project's for evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
corpora		corpora
sampler		sampler
.gitignore		.gitignore
config-sample.py		config-sample.py
crontab		crontab
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
setup.sh		setup.sh
sources.json		sources.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

argos.corpora

Setup

Exporting the database

The Sampler Package

About

Releases

Packages

Languages

publicscience/argos.corpora

Folders and files

Latest commit

History

Repository files navigation

argos.corpora

Setup

Exporting the database

The Sampler Package

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages