Data Cleansing and Model

Documentation

The main objective defined by us is to make the bibliographic data of PIK accessible. Therefore, this group is responsible for the quality and structure of the data so that others have easier access to the data. First of all, it is necessary to understand what bibliographic data is and how it is structured. It is a data format used to describe books and other media. It determines what information about the media should appear in a title track, such as the name of the author, the title, the year of publication, and many more. The order and the data types can be specified. There are different vocabularies to describe bibliographic resources. In order to visualize the data well later, some things must be clarified beforehand concerning structure and quality.

Quality of the dataset

First, we looked at the completeness and consistency of the data. To check the completeness of the data, a small script was written, which tells us the empty cells within a column. In order to get a better understanding of it. Whether these are many or few empty cells, we have additionally calculated the percentage of empty cells per column.

column name	number of empty cells*	percent of empty cells*
ID	0	0%
type	7	0,08%
title	7	0,08%
authors	7	0,08%
editors	6615	79,99%
keywords	4953	59,89%
publisher	5474	66,19%
journal	3777	45,67%
booktitle	6754	81,67%
startpage	2662	32,19%
endpage	2927	35,39%
issue	4808	58,14%
vol	4391	53,10%
year	35	0,42%
place	5348	64,67%
conference	7885	95,34%
relation	6930	83,80%
link	7296	88,22%
comment	7349	88,86%
x1	7694	93,04%
x4	3802	45,97%

*The number and percentage refers to the respective column.

We calculated this output to remove columns containing too many empty cells. We have rejected this idea afterwards, because we want to keep all the data for safety's sake, because at this stage of the project it is not clear yet whether some columns may be of interest later in the process. In addition, it may be possible to fill empty cells later. To check the consistency of the data, you must then look directly into the data set. Looking closer at the data, we noticed that almost all columns contain some inconsistencies. Thus, one must first consider, with which columns to begin the cleaning process. First, we cleaned up the names of the authors, as some names contain the full first and last name, and again, the name was abbreviated in others. A uniform structure is important for further work. There are two approaches to this problem. Either you abbreviate the first names, or you try to complete the incomplete first names. This can e.g. by the DOI (Digital Object Identifier) of the corresponding publication. We decided to use the short form to reduce the iteration time in our project. If there is still time in the end, you could consider whether to complete the name, as this would actually be the nicer option. Next, we looked at the column "year". On the one hand, the entries containing "submitted" or no publication date have been removed from the data pool. In addition, the data in the column was rounded to the year, as some entries contained a fully filled date (dd.mm.yyyy) and others only the year. In the columns "publisher" and "journal" entries which should be identical, but were written differently, were brought into a uniform format. To decide which format to use for the data, the official spelling was searched on the internet. For Example: Ökom was represented as Oekom, oekom, ökom or Ökom. Additionally, the two columns were also checked for upper and lower case and were brought into a uniform format.

Our Data Model

In order for the other groups to be able to work with the data the quality as well as the structure must be respected, otherwise it will become difficult in the future to transfer the file to the database or to visualize it. Since the data is made available through the GUI and query service of a Wikibase environment, one has to deal with the structure of Wikidata. Wikidata is a database for collecting structured data. It is free and can be read and edited by both humans and machines. It serves as a central storage for data of all Wikimedia projects, e.g. Wikipedia, Wikisource etc. It consists of elements, characteristics, qualifications, references and rankings. Elements and properties are uniquely classified by Uniform Resource Identifiers.

While we tried to stick to the WikiData Data Model when importing our data, we decided to use our own simplified model, mainly because the data we have is incomplete and incompatible with the model of WikiData, and to ease the access and visualisation of the data.

Properties

Property Data Types

string: accepts a raw string as a value
item: accepts a URI for an item in the WikiBase instance
multiple_items: accepts multiple URIs for items in the WikiBase instance.

Here is a list with our chosen properties and the corresponding data types that each property accepts:

name	data type
instance of	multiple_items
author	multiple_items
editor	multiple_items
keyword	multiple_items
publisher	item
place of publication	item
publication type	item
journal	item
conference	item
series	item
publication date	string
DOI	string
issue	string
volume	string
pages	string
title	string
reference URL	string

Label	Properties Wikidata
Language	no number
Label	no number
Description	no number
Also know as	no number
instance of	P31
author	P50
publication date	P577
published in	P1433
volume	P478
issue	P433
page(s)	P304
DOI	P356
publisher	P123
editor	P98
place of publication	P291
title	P357 oder P1476
collection	P195
edition number	P393
official website	P856
book	Q571
doctorial thesis	P1026
software	Q7397
scientific journal	Q5633421
diploma	Q217577
data publication	Q17051824
lecture	Q603773
newspaper	Q11032
report	Q10870555
publication	Q732577
scholarly article	Q13442814
creator	Q2500638
subclass of	P279

Base types

There are 3 base types in our model, all of our items inherit from these directly or indirectly:

Work
describes an item in our data
Publication Type
describes the type of the work
Creator
a base class for Authors and Editors.

The following diagram describes the properties on each Work instance:

Work

The following diagram describes our type hierarchy:

Types

Our Creator diagram is fairly simple:

Creators

We also have the following base types:

Publisher
Place of Publication
Journal
Conference
Series

SPARQL query examples

Note: since we are describing the data and the model that we have, and not the data in the WikiBase instance, we are going to use the following idioms:

P<instance of>: means the id of the instance of property in the local WikiBase instance.
Q<creator: Linneweber, V.>: means the id of the item corresponding to the creator with the name: Linneweber, V.

#get all authors
SELECT ?author, ?authorLabel WHERE {
  ?author P<instance of> Q<base: Author>.
}

#get all works by the author: Linneweber, V.
SELECT ?work, ?workLabel WHERE {
  ?work P<author> Q<creator: Linneweber, V.>.
}

#get all authors who are also editors
SELECT ?creatorLabel WHERE {
  ?creator P<instance of> Q<base: Author>.
  ?creator P<instance of> Q<base: Editor>.
}

#get all works done in Berlin
SELECT ?workLabel WHERE {
  ?work P<place of publication> Q<place: Berlin>
}

#get all works that contain the keyword: air pollution
SELECT ?workLabel WHERE {
  ?work P<keyword> Q<keyword: air pollution>
}

#get all lectures in Potsdam by M.Stock 
SELECT ?workLabel WHERE { 
 ?work P<place of publication> Q<place: Potsdam> 
 ?work P<instance of> Q<base: lecture> 
 ?work P<author> Q<creator: Stock, M.> 
}

Revised data model:

After we tried some time to work with our preliminary data model, we noticed some things. We discarded the old data model because our long-term goal is to get the data on Wikidata. Our first data model didn't match Wikidata's data model at all. We have also made great progress in cleaning up the data set, which has changed it in some places. This allows us to deal better with the data and use it differently than previously thought.

Our revised data model looks like this:

The graphic shows how the data types in Wikidata are connected and which subclasses exist for a class. We can now use these classes as our orientation and edit our data set using the graphic so that our data fits into the Wikidata format.

To have a better overview of our data, we, first of all, let you tell us how much data is currently available in the columns. We have research the wikidata data model to find corresponding properties to match our columns' names and publications.

Our initial overview is shown here in a table:

Frequency	Column in PIK data set	Wikidata property
8261	title	title P1476
8258	keywords
8235	year	publication date P577
8080	authors	author P50
7796	publisher	publisher P123
6299	startpage	number of pages P1104
6034	endpage	number of pages P1104
4493	journal	academic journal Q737498
4468	x4 ( = DOI / Identifier)	DOI P356
3879	vol	volume P478
3462	issue	issue P433
2922	place	place of publication P291
1656	editors	editor P98
1516	booktitle
1340	relation (= Serie)	part of the Series P179
974	link
921	comment
385	conference

Now that we knew exactly which Column in PIK data set we were still missing, we searched directly for the missing one and filled it in further. Therefore we have now created a list of the Wikidata property for each column in PIK data set and each type, which we can use from now on.

Publications

a publication is an instance of (P31) one of the following items:

paperr: article Q191067
papern: article Q191067
inbook: chapter Q1980247
confpaper: conference paper Q23927052
lecture: lecture Q603773
report: report Q10870555
epup: electronic publication Q21572908
inreport: research report Q59387148
intseries: technical report Q3099732
book: book Q571
newspaper: newspaper article Q2495037
edbook: edited volume Q1711593
data: data publication Q17051824
software: software project Q63437139
dipl: diploma thesis Q30749496
habil: habilitation thesis Q1414362
thesis: doctoral thesis Q187685
proceedings: proceedings Q1143604

Authors & Editors

author: Author Q482980
- author: a work has author (P50) property
editor: Editor Q1607826
- editor: a work has editor (P98) property

Others

publisher: Publisher Q2085381
- publisher: a work is published (P123) by a Publisher
journal: academic journal Q737498
- published in: a publication is published in (P1433) a journal
issue: (literal) a publication has an issue (P433)
vol: (literal) a publication has a volume (P478)
startpage & endpage: literal
- number of pages a publication has number of pages (P1104)
DOI: literal DOI (P356)
year: literal publication date (P577)
[---]: a place where a publication was published (country/city/etc...)
- place: a publication has property place of publication P291 to the place where it was published
[---]: a series of publications
- series: a publication is part of the series (P179)
[---]: anything...
- keywords: a publication has a list of main subjects (P921)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Cleansing and Model

Documentation

Quality of the dataset

Our Data Model

Properties

Base types

SPARQL query examples

Revised data model:

Publications

Authors & Editors

Others

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally