Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we clean up the data #2

Open
4 of 10 tasks
AbdBarho opened this issue May 15, 2019 · 1 comment
Open
4 of 10 tasks

How should we clean up the data #2

AbdBarho opened this issue May 15, 2019 · 1 comment
Assignees

Comments

@AbdBarho
Copy link
Member

AbdBarho commented May 15, 2019

how should we deal with the following data samples?

  • Authors & Editors
    (Sanitise the names of Authors and Editors in the data #13)
    • Toth, F.L. (guest editor)
    • et al. (including Schellnhuber, H.-J.)
    • (in co-operation with Becker, D.
      • Ballerstedt, K.)
    • Kl�cking, B.
    • (and 254 others, including Schellnhuber, H. J.)
    • Höhne, N: should we replace : with . ?
    • Kry<sanova, V

  • Publisher
    • [Corresponding paper: http://dx.doi.org/10.5194/esd-7-783-2016]

  • Start page & End page
    • 414; 304; 100;
    • Art.-No.159804
    • XXIII, 566
    • 062211-1

  • Vol
    in my opinion we should completely drop this column, there is no useful information that can be deducted



  • Relation (Serie)
    • PIK Reports ; 21
    • Warnsignal Klima - Wissenschaftliche Fakten

  • Comment
    I also think we should drop this column

  • X1, keywords and peer reviewed
    • items are sometimes separated with a comma, and other times with a semicolon

@AbdBarho AbdBarho self-assigned this May 15, 2019
@kozae
Copy link
Contributor

kozae commented Jun 9, 2019

Remarks and recommendations for cleaning the data:

Type "inbook"

  • 181 entries of the type "inbook", do not have a "booktitle", but instead journal, I recommend converting them to "paperr"
pik_df.loc[(pik_df['type'] == 'inbook') & (pik_df['booktitle'].isnull())]
  • 2 entries have a value in the column "conference", this seems unrelated, I recommend nullifying them
pik_df.loc[(pik_df['type'] == 'inbook') & ~(pik_df['conference'].isnull())]

Type "confpaper"

  • 113 of the type "confpaper" do not have a value for "conference", I believe the safest bet is to regard all of them as paperr i.e. scholarly article. The value for journal may be needed to get fetched from another database.
pik_df.loc[(pik_df['type'] == 'confpaper') & (pik_df['conference'].isnull())]

Type "lecture"

  • in total 469, and 53 of which are duplicates from other types, we need to look if "lecture" is valid for visualizations by Scholia, as we might just drop them. finding duplicates:
lecture_df = pik_df.loc[(pik_df['type'] == 'lecture')]
count = 0
for index, row in lecture_df.iterrows():
    if len(pik_df.loc[~(pik_df['type'] == 'lecture') & (pik_df['title'] == row['title'])]) !=0:
        count +=1
print(count)

Type "paperr"

  • almost half of the dataset, 3544 entries, however only 11 have value in "place". we either add "Potsdam" as the place of writings, or get the city of the publisher
pik_df.loc[(pik_df['type'] == 'paperr') & ~(pik_df['place'].isnull())]

Types "software", and "data"

  • are "software" or "data" valid for visualization in scholia? Otherwise, we could just drop them

The issue of duplicates

  • only drop the duplicate, if it is the same type, as there are sometimes an article or a lecture about a book, in those cases, the duplication is justified
for value in pik_df.type.unique():
    print(value, '---> ', pik_df.loc[(pik_df['type'] == value)]['title'].duplicated().astype(int).sum())

output:

inbook --->  77
confpaper --->  16
lecture --->  20
paperr --->  169
papern --->  25
instseries --->  2
epup --->  15
book --->  8
inreport --->  11
report --->  4
edbook --->  1
thesis --->  3
nan --->  0
proceedings --->  0
newspaper --->  6
dipl --->  0
habil --->  0
data --->  0
software --->  1

Column: oldDepartmentNames, previously "keywords"

  • it has 10 possible values in total:

    • 'Global Change',
    • 'Data',
    • 'Climate System',
    • 'Climate Research',
    • 'Social Systems',
    • 'Computation',
    • 'BAHC',
    • 'Library',
    • 'Natural Systems',
    • 'Integrated Systems Analysis'

however in some cases, these might be shortened names, and BAHC is acronym for "Biological Aspects of the Hydrological Cycle". Should we at all use these values? I.e. will they be useful for scholia? Or should we nullify these values? Do we need to research the original names?

Columns "publisher" and "journal"

  • Often, a single value is written in different ways, e.g. sometimes full name, sometimes as acronym, and with different letter case patterns. Solutions:
    • using edit distance to determine possibly related entries.
    • writing a simple procedure to determine if acronyms relate to certain values in the column, i.e. checking the first letter of each word and matching them.

Columns "comment" and "keywordsAndPeerReview"

  • Values are very inconsistent and I do not believe they are relevant to any data visualization. Either keep them, if Wikidata has a property for such arbitrary data, or not include them at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants