Event data in a box, basically.
##About
The main goal of EL:DIABLO is to overcome dependency hell and make it much easier for the end user to get up and running with event data coding. Using software designed for development operations allows us to easily share the setup we use to develop the tools and software for creating and working with event data. In short, no matter what hardware or operating system a user chooses, it is possible to replicate our exact event-data coding platform on that specific configuration. This goal is important to us for two primary reasons. First, we are striving to make the generation of event data more open than it has historically been. Things such as copyright and licensing agreements make it difficult to share source texts for coded event data, but we can make the process as transparent as possible. This is especially important since there are a multitude of seemingly minor choices that go into event data coding that can have a significant impact on the final product. The second reason we are pursuing EL:DIABLO as a project is to enable collaboration. It will no longer be unclear what steps are taken to generate event data, or what the various moving pieces within the system are. If someone wishes, for example, to develop a new event coder, they can simply drop that in to the existing pipeline. The same holds for the various dictionaries, geocoders, web scrapers, etc. It's like Legos. But for event data.
##Components
On the technical side of things, EL:DIABLO provides the information and scripts
necessary to set up a virtual machine
on a user's computer. For those not familiar, this can be thought of as a
computer within a computer. EL:DIABLO relies on Vagrant,
and by extension VirtualBox, to set up this
virtual environment. These two pieces of software allow for the easy setup and
use of a virtual machine. Thus, two of the files contained within EL:DIABLO are
a Vagrantfile
, which gives instructions to Vagrant on how to setup the virtual
machine, and bootstrap.sh
, which is a shell script
that installs the necessary software within the virtual machine.
The EL:DIABLO event coding platform is comprised of two primary applications:
a web scraper and a processing pipeline (scraper
and
phoenix_pipeline
specifically). The scraper is a simple web scraper that
makes use of a whitelist of RSS feeds to pull stories from popular news
outlets. The pipeline moves the news stories from storage in a database to the
event coder, such as TABARI or PETRARCH, and outputs event data. More
information about the details of these projects can be found in their
respective documentation, linked to above. If you use the standard
bootstrap.sh
script provided with EL:DIABLO, the web scraper will run once an
hour, and the pipeline will run once a day at 01:00.
##Setting up
As mentioned above, EL:DIABLO relies on Vagrant and VirtualBox for most of the
heavy lifting. This means that the only things that
a user needs to install on their local machine are these two pieces of
software. The creators of this software describe the install process better
than we can, so a user should look here
for Vagrant and here
for VirtualBox. Once that software is installed, EL:DIABLO needs to be
downloaded from the Github repository.
For those familiar with git
, a git clone
should work fine. For those
unfamiliar with git
, it is possible to download the repository as a zip file
as shown in the picture below.
Note: We've tested this setup on Vagrant 1.6.5
Once this file is downloaded and unzipped, you should use the command line to cd into the
directory and do vagrant up
. This will take awhile to download the operating system image
(this will only be done once) and then install the relevant software within the virtual
machine. Seriously, this is going to take time; the process hasn't stalled out. Then vagrant ssh
to get into the box. You're now in the virtual machine. Overall, this should look something like:
As a note, all of this will create a folder somewhere on your local machine
that contains the operating system images. On OSX it's in the home directory and
named VirtualBox VMs
.
To get out of the virtual machine, type exit
, which will bring you back to your local machine.
There are three methods for ending the Vagrant box: vagrant suspend
, vagrant halt
, and
vagrant destroy
. The main difference between these three is the amount of
system resources used while in the "down" state. If you are completely done
with the virtual machine, and do not wish to keep any of the data, make use of
vagrant destroy
. Again, this will remove all of the data within the virtual
machine and all software will have to be reinstalled. If you wish to just
temporarily bring down the virtual machine, the other two commands should be
explored in the Vagrant documentation.
##Other Information
Due to the way Vagrant sets up the virtual machine, it is necessary to prepend nearly
every command with sudo
.
The filepaths in the config file for the stanford_pipeline
need to be changed
to use absolute paths. For example:
cd ~/stanford_pipeline
sudo vim default_config.ini
Once in the config, change the ~/
characters to /home/vagrant/
.
The bootstrap.sh
script is specifically configured for use with the Vagrant
box, but with slight modifications can be used on any Linux box (it's what we
use to bootstrap our machines). This means that the script can serve as the
basis for setting up a high-performance computer running EL:DIABLO, an
individual's laptop, etc.
Currently the virtual machine takes up 4GB of RAM. Less than this doesn't really work since the shift-reduce parser needs a fair amount of memory to operate.
For the two Github repositories, scraper
and phoenix_pipeline
, each time
vagrant up
is run the most recent version of the code is pulled from Github.
If you have a long-running virtual machine and wish to obtain the latest code,
you can cd
into the appropriate directory and run sudo git pull
.