-
Notifications
You must be signed in to change notification settings - Fork 1
Data Cleansing and Model
The main objective defined by us is to make the bibliographic data of PIK accessible. Therefore, this group is responsible for the quality and structure of the data so that others have easier access to the data. First of all, it is necessary to understand what bibliographic data is and how it is structured. It is a data format used to describe books and other media. It determines what information about the media should appear in a title track, such as the name of the author, the title, the year of publication, and many more. The order and the data types can be specified. There are different vocabularies to describe bibliographic resources. In order to visualize the data well later, some things must be clarified beforehand concerning structure and quality.
First, we looked at the completeness and consistency of the data. To check the completeness of the data, a small script was written, which tells us the empty cells within a column. In order to get a better understanding of it. Whether these are many or few empty cells, we have additionally calculated the percentage of empty cells per column.
column name | number of empty cells* | percent of empty cells* |
---|---|---|
ID | 0 | 0% |
type | 7 | 0,08% |
title | 7 | 0,08% |
authors | 7 | 0,08% |
editors | 6615 | 79,99% |
keywords | 4953 | 59,89% |
publisher | 5474 | 66,19% |
journal | 3777 | 45,67% |
booktitle | 6754 | 81,67% |
startpage | 2662 | 32,19% |
endpage | 2927 | 35,39% |
issue | 4808 | 58,14% |
vol | 4391 | 53,10% |
year | 35 | 0,42% |
place | 5348 | 64,67% |
conference | 7885 | 95,34% |
relation | 6930 | 83,80% |
link | 7296 | 88,22% |
comment | 7349 | 88,86% |
x1 | 7694 | 93,04% |
x4 | 3802 | 45,97% |
*The number and percentage refers to the respective column.
We calculated this output to remove columns containing too many empty cells. We have rejected this idea afterwards, because we want to keep all the data for safety's sake, because at this stage of the project it is not clear yet whether some columns may be of interest later in the process. In addition, it may be possible to fill empty cells later. To check the consistency of the data, you must then look directly into the data set. Looking closer at the data, we noticed that almost all columns contain some inconsistencies. Thus, one must first consider, with which columns to begin the cleaning process. First, we cleaned up the names of the authors, as some names contain the full first and last name, and again, the name was abbreviated in others. A uniform structure is important for further work. There are two approaches to this problem. Either you abbreviate the first names, or you try to complete the incomplete first names. This can e.g. by the DOI (Digital Object Identifier) of the corresponding publication. We decided to use the short form to reduce the iteration time in our project. If there is still time in the end, you could consider whether to complete the name, as this would actually be the nicer option. Next, we looked at the column "year". On the one hand, the entries containing "submitted" or no publication date have been removed from the data pool. In addition, the data in the column was rounded to the year, as some entries contained a fully filled date (dd.mm.yyyy) and others only the year. In the columns "publisher" and "journal" entries which should be identical, but were written differently, were brought into a uniform format. To decide which format to use for the data, the official spelling was searched on the internet. For Example: Ökom was represented as Oekom, oekom, ökom or Ökom. Additionally, the two columns were also checked for upper and lower case and were brought into a uniform format.
In order for the other groups to be able to work with the data the quality as well as the structure must be respected, otherwise it will become difficult in the future to transfer the file to the database or to visualize it. Since the data is made available through the GUI and query service of a Wikibase environment, one has to deal with the structure of Wikidata. Wikidata is a database for collecting structured data. It is free and can be read and edited by both humans and machines. It serves as a central storage for data of all Wikimedia projects, e.g. Wikipedia, Wikisource etc. It consists of elements, characteristics, qualifications, references and rankings. Elements and properties are uniquely classified by Uniform Resource Identifiers.
While we tried to stick to the WikiData Data Model when importing our data, we decided to use our own simplified model, mainly because the data we have is incomplete and incompatible with the model of WikiData, and to ease the access and visualisation of the data.
Property Data Types
-
string
: accepts a raw string as a value -
item
: accepts a URI for an item in the WikiBase instance -
multiple_items
: accepts multiple URIs for items in the WikiBase instance.
Here is a list with our chosen properties and the corresponding data types that each property accepts:
name | data type |
---|---|
instance of | multiple_items |
author | multiple_items |
editor | multiple_items |
keyword | multiple_items |
publisher | item |
place of publication | item |
publication type | item |
journal | item |
conference | item |
series | item |
publication date | string |
DOI | string |
issue | string |
volume | string |
pages | string |
title | string |
reference URL | string |
Label | Properties Wikidata |
---|---|
Language | no number |
Label | no number |
Description | no number |
Also know as | no number |
instance of | P31 |
author | P50 |
publication date | P577 |
published in | P1433 |
volume | P478 |
issue | P433 |
page(s) | P304 |
DOI | P356 |
publisher | P123 |
editor | P98 |
place of publication | P291 |
title | P357 oder P1476 |
collection | P195 |
edition number | P393 |
official website | P856 |
book | Q571 |
doctorial thesis | P1026 |
software | Q7397 |
scientific journal | Q5633421 |
diploma | Q217577 |
data publication | Q17051824 |
lecture | Q603773 |
newspaper | Q11032 |
report | Q10870555 |
publication | Q732577 |
scholarly article | Q13442814 |
creator | Q2500638 |
subclass of | P279 |
There are 3 base types in our model, all of our items inherit from these directly or indirectly:
-
Work
describes an item in our data -
Publication Type
describes the type of the work -
Creator
a base class for Authors and Editors.
The following diagram describes the properties on each Work instance:
The following diagram describes our type hierarchy:
Our Creator diagram is fairly simple:
We also have the following base types:
- Publisher
- Place of Publication
- Journal
- Conference
- Series
Note: since we are describing the data and the model that we have, and not the data in the WikiBase instance, we are going to use the following idioms:
-
P<instance of>
: means the id of the instance of property in the local WikiBase instance. -
Q<creator: Linneweber, V.>
: means the id of the item corresponding to the creator with the name: Linneweber, V.
#get all authors
SELECT ?author, ?authorLabel WHERE {
?author P<instance of> Q<base: Author>.
}
#get all works by the author: Linneweber, V.
SELECT ?work, ?workLabel WHERE {
?work P<author> Q<creator: Linneweber, V.>.
}
#get all authors who are also editors
SELECT ?creatorLabel WHERE {
?creator P<instance of> Q<base: Author>.
?creator P<instance of> Q<base: Editor>.
}
#get all works done in Berlin
SELECT ?workLabel WHERE {
?work P<place of publication> Q<place: Berlin>
}
#get all works that contain the keyword: air pollution
SELECT ?workLabel WHERE {
?work P<keyword> Q<keyword: air pollution>
}
#get all lectures in Potsdam by M.Stock
SELECT ?workLabel WHERE {
?work P<place of publication> Q<place: Potsdam>
?work P<instance of> Q<base: lecture>
?work P<author> Q<creator: Stock, M.>
}
After we tried some time to work with our preliminary data model, we noticed some things. We discarded the old data model because our long-term goal is to get the data on Wikidata. Our first data model didn't match Wikidata's data model at all. We have also made great progress in cleaning up the data set, which has changed it in some places. This allows us to deal better with the data and use it differently than previously thought.
Our revised data model looks like this:
The graphic shows how the data types in Wikidata are connected and which subclasses exist for a class. We can now use these classes as our orientation and edit our data set using the graphic so that our data fits into the Wikidata format.
To have a better overview of our data, we, first of all, let you tell us how much data is currently available in the columns. We have research the wikidata data model to find corresponding properties to match our columns' names and publications.
Our initial overview is shown here in a table:
Frequency | Column in PIK data set | Wikidata property |
---|---|---|
8261 | title | title P1476 |
8258 | keywords | |
8235 | year | publication date P577 |
8080 | authors | author P50 |
7796 | publisher | publisher P123 |
6299 | startpage | number of pages P1104 |
6034 | endpage | number of pages P1104 |
4493 | journal | academic journal Q737498 |
4468 | x4 ( = DOI / Identifier) | DOI P356 |
3879 | vol | volume P478 |
3462 | issue | issue P433 |
2922 | place | place of publication P291 |
1656 | editors | editor P98 |
1516 | booktitle | |
1340 | relation (= Serie) | part of the Series P179 |
974 | link | |
921 | comment | |
385 | conference |
Now that we knew exactly which Column in PIK data set we were still missing, we searched directly for the missing one and filled it in further. Therefore we have now created a list of the Wikidata property for each column in PIK data set and each type, which we can use from now on.
a publication is an instance of (P31) one of the following items:
-
paperr
: article Q191067 -
papern
: article Q191067 -
inbook
: chapter Q1980247 -
confpaper
: conference paper Q23927052 -
lecture
: lecture Q603773 -
report
: report Q10870555 -
epup
: electronic publication Q21572908 -
inreport
: research report Q59387148 -
intseries
: technical report Q3099732 -
book
: book Q571 -
newspaper
: newspaper article Q2495037 -
edbook
: edited volume Q1711593 -
data
: data publication Q17051824 -
software
: software project Q63437139 -
dipl
: diploma thesis Q30749496 -
habil
: habilitation thesis Q1414362 -
thesis
: doctoral thesis Q187685 -
proceedings
: proceedings Q1143604
-
author
: Author Q482980-
author
: a work has author (P50) property
-
-
editor
: Editor Q1607826-
editor
: a work has editor (P98) property
-
-
publisher
: Publisher Q2085381-
publisher
: a work is published (P123) by a Publisher
-
-
journal
: academic journal Q737498-
published in
: a publication is published in (P1433) a journal
-
-
issue
: (literal) a publication has an issue (P433) -
vol
: (literal) a publication has a volume (P478) -
startpage & endpage
: literal-
number of pages
a publication has number of pages (P1104)
-
-
DOI
: literal DOI (P356) -
year
: literal publication date (P577) -
[---]
: a place where a publication was published (country/city/etc...)-
place
: a publication has property place of publication P291 to the place where it was published
-
-
[---]
: a series of publications-
series
: a publication is part of the series (P179)
-
-
[---]
: anything...-
keywords
: a publication has a list of main subjects (P921)
-