This repository contains a set of python scripts used to transfer data from PMA to JHRDR, and documentation of how the work was done.
PMA (Performance Monitoring for Accountability) is an research group at the School of Public Health. They have a client account from DataCite and can mint their own DOIs. Their funding with the Gates Foundation is ended is in 2024, so they asked us to transfer the datasets to be hosted on JHRDR. These datasets already have DOIs and researchers want to keep the same DOIs. They also have landing pages with metadata and minimal metadata in DataCite.
We utilized the Import CSV to Dataverse functionality of the python package pydataverse.
We adapted the pydataverse CSV templates to be user-friendly to PMA, and then wrote python scripts to transform the metadata we received from them into the CSV template format and to automate the upload of the metadata and files into JHRDR.
PMA_metadata_template.xlsx: the template we provided to PMA to fill out with metadata about the collections, datasets, and files to be uploadedfill_csv.py- Python script to transform the metadata recieved from PMA into the pydataverse csv template for batch dataset upload - for the PMA datasets.ingest_csv.py- Python script to upload files to the PMA datasets.find_file_mismatches.py- identifies files that are listed in the metadata sheet but not found in the file directory.
For licensing reasons, the actual data files uploaded to JHRDR are not included in this repository.
The workflow is as follows:
- Insert files to be uploaded into the same directory as
PMA_metadata_template.xlsx. The files for each dataset should be in their own folder, with the name of that folder indicated in the "Folder Name" column of the files sheet ofPMA_metadata_template.xlsx. (as described in the instructions sheet). - Run
fill_csv.pyto transform metadata collected inPMA_metadata_template.xlsxto the schema required for upload by pydataverse. This will output 2 files:pma_datasets_toupload.csvandpma_files_toupload.csv. - Optional Run
find_file_mismatches.pyto identify files that are listed in the files metadata csv but not found in the file directory. - Run
ingest_csv.py, which reads inpma_datasets_toupload.csvandpma_files_toupload.csvand creates the datasets and upload their corresponding files to Dataverse. Be sure to change the value fordv_aliasto the Dataverse you want to upload to.
Python 3.11.4pandas 1.5.3numpy 1.24.3jsonospydataverse 0.3.1