PIK Data Cleanup Pipeline

This repository puts together the different procedures, that are being implemented to cleanup the PIK publication data.

setup and usage

Python 3.6 or greater is needed, install the dependencies

pip install -r requirements.txt

the original data are in the file pik_input.csv, to apply the cleanup run:

python app.py

the cleaned data are in pik_output.csv. You can use the iPython experiment_and_view.ipnyb notebook for viewing the data the basic imports you would need for that are already on the top of the file. It would be helpful to check this guide on Pandas

Adding more data-cleaning blocks

You would need to write a function that takes a pandas.Dataframe object as input and returns a modified pandas.Dataframe. Include your code in a single file in the folder handlers. The input pandas.Datframe instance must be the output of the first handler: preprocessing/format_data.py. Import the new function thus in the FUNCTIONS list in app.py:

FUNCTIONS = [
    format_data,  # this function must always run first
    doi_cleanup,
    author_editor_cleanup,
    year_cleanup,
    # YOUR ADDITIONAL FUNCTION HERE
]

In testing and developing the new cleanup function you can use the iPython notebook 'experiment_and_view.ipnyb', on top of which the needed preprocessed pandas.Dataframe is already imported.

Implemented Data-Cleanup procedures:

So far, these are the inconsistencies that have been corrected:

0) Preprocessing

Ensure that cells without a value have a python np.NaN object instead of spaces or line breaks.
Rename the coulmns with less ambigous names.

1) DOI Cleanup

All DOIs were turned into hyperlinks.
Invalid entries in the DOI coulmn were nullified.

2) Author and Editor Names Cleanup

remove the redundant colon ( : ) character from all the cells.

3) Year Cleanup

Entries, in which publications are marked as 'submitted', were deleted.
Short date forms mm/dd/yy and dd/mm/yy were converted into yyyy.
Some typos are corrected

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs/images		docs/images
handlers		handlers
preprocessing		preprocessing
sparql		sparql
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
experiment_and_view.ipynb		experiment_and_view.ipynb
pik_input.csv		pik_input.csv
pik_refined.csv		pik_refined.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PIK Data Cleanup Pipeline

setup and usage

Adding more data-cleaning blocks

Implemented Data-Cleanup procedures:

0) Preprocessing

1) DOI Cleanup

2) Author and Editor Names Cleanup

3) Year Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

code-openness/Data

Folders and files

Latest commit

History

Repository files navigation

PIK Data Cleanup Pipeline

setup and usage

Adding more data-cleaning blocks

Implemented Data-Cleanup procedures:

0) Preprocessing

1) DOI Cleanup

2) Author and Editor Names Cleanup

3) Year Cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages