Skip to content

Data Cleansing and Model

Jhantke edited this page Jul 8, 2019 · 1 revision

Documentation

The main objective defined by us is to make the bibliographic data of PIK accessible. Therefore, this group is responsible for the quality and structure of the data so that others have easier access to the data. First of all, it is necessary to understand what bibliographic data is and how it is structured. It is a data format used to describe books and other media. It determines what information about the media should appear in a title track, such as the name of the author, the title, the year of publication, and many more. The order and the data types can be specified. There are different vocabularies to describe bibliographic resources. In order to visualize the data well later, some things must be clarified beforehand concerning structure and quality.

Quality of the dataset

First, we looked at the completeness and consistency of the data. To check the completeness of the data, a small script was written, which tells us the empty cells within a column. In order to get a better understanding of it. Whether these are many or few empty cells, we have additionally calculated the percentage of empty cells per column.

column name number of empty cells* percent of empty cells*
ID 0 0%
type 7 0,08%
title 7 0,08%
authors 7 0,08%
editors 6615 79,99%
keywords 4953 59,89%
publisher 5474 66,19%
journal 3777 45,67%
booktitle 6754 81,67%
startpage 2662 32,19%
endpage 2927 35,39%
issue 4808 58,14%
vol 4391 53,10%
year 35 0,42%
place 5348 64,67%
conference 7885 95,34%
relation 6930 83,80%
link 7296 88,22%
comment 7349 88,86%
x1 7694 93,04%
x4 3802 45,97%

*The number and percentage refers to the respective column.

We calculated this output to remove columns containing too many empty cells. We have rejected this idea afterwards, because we want to keep all the data for safety's sake, because at this stage of the project it is not clear yet whether some columns may be of interest later in the process. In addition, it may be possible to fill empty cells later. To check the consistency of the data, you must then look directly into the data set. Looking closer at the data, we noticed that almost all columns contain some inconsistencies. Thus, one must first consider, with which columns to begin the cleaning process. First, we cleaned up the names of the authors, as some names contain the full first and last name, and again, the name was abbreviated in others. A uniform structure is important for further work. There are two approaches to this problem. Either you abbreviate the first names, or you try to complete the incomplete first names. This can e.g. by the DOI (Digital Object Identifier) of the corresponding publication. We decided to use the short form to reduce the iteration time in our project. If there is still time in the end, you could consider whether to complete the name, as this would actually be the nicer option. Next, we looked at the column "year". On the one hand, the entries containing "submitted" or no publication date have been removed from the data pool. In addition, the data in the column was rounded to the year, as some entries contained a fully filled date (dd.mm.yyyy) and others only the year. In the columns "publisher" and "journal" entries which should be identical, but were written differently, were brought into a uniform format. To decide which format to use for the data, the official spelling was searched on the internet. For Example: Ökom was represented as Oekom, oekom, ökom or Ökom. Additionally, the two columns were also checked for upper and lower case and were brought into a uniform format.

Our Data Model

In order for the other groups to be able to work with the data the quality as well as the structure must be respected, otherwise it will become difficult in the future to transfer the file to the database or to visualize it. Since the data is made available through the GUI and query service of a Wikibase environment, one has to deal with the structure of Wikidata. Wikidata is a database for collecting structured data. It is free and can be read and edited by both humans and machines. It serves as a central storage for data of all Wikimedia projects, e.g. Wikipedia, Wikisource etc. It consists of elements, characteristics, qualifications, references and rankings. Elements and properties are uniquely classified by Uniform Resource Identifiers.

While we tried to stick to the WikiData Data Model when importing our data, we decided to use our own simplified model, mainly because the data we have is incomplete and incompatible with the model of WikiData, and to ease the access and visualisation of the data.

Properties

Property Data Types

  • string: accepts a raw string as a value
  • item: accepts a URI for an item in the WikiBase instance
  • multiple_items: accepts multiple URIs for items in the WikiBase instance.

Here is a list with our chosen properties and the corresponding data types that each property accepts:

name data type
instance of multiple_items
author multiple_items
editor multiple_items
keyword multiple_items
publisher item
place of publication item
publication type item
journal item
conference item
series item
publication date string
DOI string
issue string
volume string
pages string
title string
reference URL string
Label Properties Wikidata
Language no number
Label no number
Description no number
Also know as no number
instance of P31
author P50
publication date P577
published in P1433
volume P478
issue P433
page(s) P304
DOI P356
publisher P123
editor P98
place of publication P291
title P357 oder P1476
collection P195
edition number P393
official website P856
book Q571
doctorial thesis P1026
software Q7397
scientific journal Q5633421
diploma Q217577
data publication Q17051824
lecture Q603773
newspaper Q11032
report Q10870555
publication Q732577
scholarly article Q13442814
creator Q2500638
subclass of P279

Base types

There are 3 base types in our model, all of our items inherit from these directly or indirectly:

  • Work
    describes an item in our data
  • Publication Type
    describes the type of the work
  • Creator
    a base class for Authors and Editors.

The following diagram describes the properties on each Work instance:

Work

The following diagram describes our type hierarchy:

Types

Our Creator diagram is fairly simple:

Creators

We also have the following base types:

  • Publisher
  • Place of Publication
  • Journal
  • Conference
  • Series

SPARQL query examples

Note: since we are describing the data and the model that we have, and not the data in the WikiBase instance, we are going to use the following idioms:

  • P<instance of>: means the id of the instance of property in the local WikiBase instance.
  • Q<creator: Linneweber, V.>: means the id of the item corresponding to the creator with the name: Linneweber, V.
#get all authors
SELECT ?author, ?authorLabel WHERE {
  ?author P<instance of> Q<base: Author>.
}
#get all works by the author: Linneweber, V.
SELECT ?work, ?workLabel WHERE {
  ?work P<author> Q<creator: Linneweber, V.>.
}
#get all authors who are also editors
SELECT ?creatorLabel WHERE {
  ?creator P<instance of> Q<base: Author>.
  ?creator P<instance of> Q<base: Editor>.
}
#get all works done in Berlin
SELECT ?workLabel WHERE {
  ?work P<place of publication> Q<place: Berlin>
}
#get all works that contain the keyword: air pollution
SELECT ?workLabel WHERE {
  ?work P<keyword> Q<keyword: air pollution>
}
#get all lectures in Potsdam by M.Stock 
SELECT ?workLabel WHERE { 
 ?work P<place of publication> Q<place: Potsdam> 
 ?work P<instance of> Q<base: lecture> 
 ?work P<author> Q<creator: Stock, M.> 
}

Revised data model:

After we tried some time to work with our preliminary data model, we noticed some things. We discarded the old data model because our long-term goal is to get the data on Wikidata. Our first data model didn't match Wikidata's data model at all. We have also made great progress in cleaning up the data set, which has changed it in some places. This allows us to deal better with the data and use it differently than previously thought.

Our revised data model looks like this:

The graphic shows how the data types in Wikidata are connected and which subclasses exist for a class. We can now use these classes as our orientation and edit our data set using the graphic so that our data fits into the Wikidata format.

To have a better overview of our data, we, first of all, let you tell us how much data is currently available in the columns. We have research the wikidata data model to find corresponding properties to match our columns' names and publications.

Our initial overview is shown here in a table:

Frequency Column in PIK data set Wikidata property
8261 title title P1476
8258 keywords
8235 year publication date P577
8080 authors author P50
7796 publisher publisher P123
6299 startpage number of pages P1104
6034 endpage number of pages P1104
4493 journal academic journal Q737498
4468 x4 ( = DOI / Identifier) DOI P356
3879 vol volume P478
3462 issue issue P433
2922 place place of publication P291
1656 editors editor P98
1516 booktitle
1340 relation (= Serie) part of the Series P179
974 link
921 comment
385 conference

Now that we knew exactly which Column in PIK data set we were still missing, we searched directly for the missing one and filled it in further. Therefore we have now created a list of the Wikidata property for each column in PIK data set and each type, which we can use from now on.

Publications

a publication is an instance of (P31) one of the following items:

Authors & Editors

Others