This repository puts together the different procedures, that are being implemented to cleanup the PIK publication data.
Python 3.6 or greater is needed, install the dependencies
pip install -r requirements.txt
the original data are in the file pik_input.csv
, to apply the cleanup run:
python app.py
the cleaned data are in pik_output.csv
. You can use the iPython experiment_and_view.ipnyb
notebook for viewing the data the basic imports you would need for that are already on the top of the file. It would be helpful to check this guide on Pandas
You would need to write a function that takes a pandas.Dataframe object as input and returns a modified pandas.Dataframe. Include your code in a single file in the folder handlers
. The input pandas.Datframe instance must be the output of the first handler: preprocessing/format_data.py
. Import the new function thus in the FUNCTIONS list in app.py
:
FUNCTIONS = [
format_data, # this function must always run first
doi_cleanup,
author_editor_cleanup,
year_cleanup,
# YOUR ADDITIONAL FUNCTION HERE
]
In testing and developing the new cleanup function you can use the iPython notebook 'experiment_and_view.ipnyb', on top of which the needed preprocessed pandas.Dataframe is already imported.
So far, these are the inconsistencies that have been corrected:
- Ensure that cells without a value have a python np.NaN object instead of spaces or line breaks.
- Rename the coulmns with less ambigous names.
- All DOIs were turned into hyperlinks.
- Invalid entries in the DOI coulmn were nullified.
remove the redundant colon ( : ) character from all the cells.
- Entries, in which publications are marked as 'submitted', were deleted.
- Short date forms mm/dd/yy and dd/mm/yy were converted into yyyy.
- Some typos are corrected