Skip to content

Institution Location

OllyButters edited this page Jun 5, 2020 · 4 revisions

There are 3 steps to getting an institution location.

  1. Institution Cleaning - Get a clean institution name using first author email address or affiliation.
  2. Wikidata Lookup - Carry out a wikidata query to look up the coordinate location for the clean institution name. Resort to backup coordinates if required.
  3. Google Address Lookup - Use google maps API to look up the approximate address (City and country) of the coordinates.

Institution Cleaning

The cleaning code is in /source/clean/clean.py. Clean insititution names need to be added manualy. The cleaning file is /config/institute_cleaning.csv. There is a section for email addresses and for institution names. The clean names are in the right hand column.

Wikidata Lookup

The wikidata lookup uses a SPARQL query to get the coordinates of an institution. The name of the institution on wikidata must exactly match the clean institution name for it to be found. The SPARQL returns the wikidata id for the institution. This is then used to do an API request to get all the data for the object. The property of interest is P625 which is the coordinate location of the wikidata item. Sometimes the location of the institution is not its own statement and instead the P625 property is part of the headquarters (P159) statement.

New wikidata items can be added through a link on the left sidebar (New Wikidata Item). A lot of the Wikidata items that have been added have been removed this might be because admins do not think that the descriptions and aliases are good enough so make sure the description is complete. Wikidata items are made up of statments, e.g. Coordinate location. Statements can be added by scrolling to the bottom of the statements section on the item page and clicking "add" and then type "coordinate location" or "P625" into the property box. If the institution has an existing headquarters statement it might be better to add the coordinates to the headquarters statement. This is also a good idea if the institute is spread out over a city rather than being on one campus. In the big papers data object the coordinates can be accessed using the index ['Extras']['LatLong'].

There can be problems when the names of institutions are different in different languages. For example, the Technische Universität München was originally listed on wikidata under its German name but on English google is is referred to as Technical University of Munich. This is a problem because the query attempts an exact match with the English label. The script should be able to handle the special characters, such as ü, but you need to be careful when deciding what name to put into the institute cleaning file.

Coordinates Backup

A problem with Wikidata is that after adding items they are moderated and can be deleted. This happened a lot of times during development and they do not give a reason for doing it. The is especially true for some of the old institutes that don't exist anymore and is therefore difficult to reference. The Wikidata route is the preferred data because it ensures data is up to data and can easily be corrected, but if adding an institute to Wikidata just isn't possible the coordinate data can be added to the institute_coordinates.csv file in the config folder. (Wikidata will override the csv file)

Google Address Lookup

The coordinates need to be converted to cities and countries to be plotted on the different maps on the html pages. The google maps API takes in the coordinates and returns a breakdown of the address. The country and city are set in the big data object at the following indexes ['Extras']['country_code'] and ['Extras']['postal_town'].