A Republic of Emails is a project in which students of the History master at the University of Luxembourg experiment with digital methods/tools on the Hillary Clinton emails. These emails were released as part of a FOIA request (see https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy#Freedom_of_Information_lawsuits), and subsequently hosted by Wikileaks at https://wikileaks.org/clinton-emails/. To analyse the emails using digital tools, the Wikileaks email archive needs to be scraped to put the emails in an appropriate format. The scripts on this GitHub repository can be used to scrape and process the archive.
- Installation
- Scraping the Wikileaks email archive
- Text normalisation of email contents
- Named Entity Recognition of emails contents using Stanford NER
- Notes
Before installation, the scraping script requires Node.js to be installed, see https://nodejs.org/en/download/.
-
Download A-Republic-of-Emails source in a terminal or command line:
git clone https://github.com/C2DH/A-Republic-of-Emails.git
-
Change into the
A-Republic-of-Emails
directory:cd A-Republic-of-Emails
-
Install the Node.js node:
npm install
In the settings.js
file, specify the number of emails to be scraped per run [1]. The default is 10
, but we advise a number between 500-1500. The script counts the number of scraped emails in order to start with the next email.
Launch the main script:
node index.js
and wait for the process to be completed.
Sometimes, some emails may not be scraped correctly due to time-out errors. To check your database, change the number of emails to be scraped in the settings.js
file to 10
and run the script again with:
node index.js
The script will run through all the scraped emails and scrape missing emails. If you get the error TypeError: Cannot read property 'split' of undefined
this means there are no more emails to be scraped.
The script will save the contents of the emails in separate .txt
files in the contents
folder in separate folders per 1,000 emails, from f-0
to f-30
. These folders are automatically created.
The metadata of the emails are saved in export.csv
and export.json
, which contain the exact same data:
url
(The Wikileaks URL)src
(The location of the.txt
filedata
(empty)From
(The sender of the email)To
(The receiver of the email)Subject
(The email title)Date
(The date and time of sending the email)contents
(empty)
After scraping, we can perform text normalisation of the email texts (the separate .txt
files), the stemmer.js
script will perform the following steps:
- Tokenising of all the words
- Lowercasing of all words
- Removal of numeric values
- Stem the words using UEA-lite [2]
- Removal of stopwords [3]
To launch the stemming script, change into the A-Republic-of-Emails
directory (if you were not there already):
cd A-Republic-of-Emails
And run stemmer.js
:
node stemmer.js
The normalised emails are saved as separate .txt
files in a subfolder of the folder containing emails. Thus all emails of the /contents/f-0/
folder are normalised and saved into /contents/f-0/stems/
, etc.
- If you do not want to perform step 2. Lowercasing of all words, comment out line 29 in
stemmer.js
and uncomment line 30 - If you do not want to perform step 5. Removal of stopwords, comment out line 36 in
stemmer.js
Another course of action to take after scraping is Named Entity Recognition (NER), so that we can create a list of mentioned people, places, or organizations.
In order to perform Named Entity Recognition, the Stanford NER toolkit needs to be on your computer, see http://nlp.stanford.edu/software/CRF-NER.shtml#Download. Before continuing, please test the Stanford NER using the readMe file in the extracted folder.
To load the NER classifier, drag the ner-server.sh
file into the stanford NER folder[4]. After doing so, change into the stanford NER folder. For example, if you have called this folder stanfordNER, then:
cd stanfordNER
And run ner-server.sh
:
./ner-server.sh
Open a new terminal window and change into the A-Republic-of-Emails
directory there:
cd A-Republic-of-Emails
And run stanfordNER.js
:
node stanfordNER.js
The stanfordNER will run the 7 class model: Location, Person, Organization, Money, Percent, Date, Time. However, we will export only the first three types to a CSV file: Location, Person, Organization. To do so run tags.js
:
node tags.js
First, the stanfordNER.js
script will store all named entities (NEs) per email as separate .json
files in a subfolder of the folder containing emails. Thus all emails of the /contents/f-0
folder are normalised and saved into /contents/f-0/NER
, etc.
Second, the tags.js
script will load these .json
files and store the Location, Person, Organisation NE types in export.ner.csv
which contains the following columns:
url
(The Wikileaks URL)src
(The location of the.txt
filedata
(empty)From
(The sender of the email)To
(The receiver of the email)Subject
(The email title)Date
(The date and time of sending the email)locations
(Locations mentioned in the email)people
(People mentioned in the email)organizations
(Organizations mentioned in the email)
All named entities per column are pipe-separated |
- The scraper is gratefully based on the sandcrawler.js library created by Guillaume Plique from SciencePo’s Medialab
- The stemming script is based on UEA-lite, and is based on the Talisman library created by Guillaume Plique from SciencePo’s Medialab
- The list of stopwords is taken from the default english stopwords list provided by RANKS NL
ner-server.sh
is an adaptation from the originalner-server.sh
by Nikhil Srivastava