-
Notifications
You must be signed in to change notification settings - Fork 4
Local Extra Data
There are several parts of the script that require additional local data. This is actual data (i.e. not configuration) that is either used for decision making, or is needed as it cannot be gotten from external APIs.
These may require some small additions in order to get the script to work properly for all papers. If you can't see the way to resolve an issue on this page click here where you will be able to see what to do when data is missing for online abstract libraries (e.g. publication date).
Affiliation names need to be "cleaned" so that wikidata queries can be performed. The cleaning is done via a manually built lookup up which has variations of institution names mapped to the canonical name. The canonical name used here is the name used in wikidata. Email addresses are also used to map to the canonical name. This process means that e.g.
University of Bristol, Bristol University, Some department at Bristol University, [email protected], [email protected] all get mapped to University of Bristol.
The cleaning code is in /source/clean/clean.py
. Clean institution names need to be added manually. The cleaning file is /config/institute_cleaning.csv
. There is a section for email addresses and for institution names. The clean names are in the right hand column.
From time to time universities change their names on wikidata, this is most often when someone edits them to remove their legal name e.g. legally it is University of Newcastle, but everyone (including the univesrity itself) call it Newcastle University. The same is true for Durham and others. This means that the lookup may unexpectedly get worse coverage, but that is the nature of external data sources.
A problem with Wikidata is that after adding items they are moderated and can be deleted. This happened a lot during development and they do not give a reason for doing it. This is especially true for some of the old institutes that don't exist anymore because is therefore difficult to reference. The Wikidata route is the preferred data because it ensures data is up to date and can easily be corrected, but if adding an institute to Wikidata just isn't possible the coordinate data can be added to the institute_coordinates.csv
file in the config folder. (Wikidata items will override the csv file)
Stop words are used for the abstract word analysis to filter out irrelevant words such as "and" and "the". The stop words file path is /config/stopwords
. To add a word just type the word onto a new line. Comments can be created using the |
symbol.
Introduction
Install and run
Reference
Misc