Mapping Research Software Landscapes through Exploratory Studies of GitHub Data

This repo holds the code, latex files, and instructions for my master thesis with the topic Mapping Research Software Landscapes through Exploratory Studies of GitHub Data.

Prerequisites

Please refer to the SWORDS-UU framework for necessary prerequisites. For this study, you will need some basic familiarity with Python and Jupyter Notebooks.

Reproducing data retrieval

The data retrieval is based on the SWORDS-UU framework. Please keep in mind that the steps are not 100% reproducible due to dependencies on external data of Utrecht University and GitHub.

First, follow the instructions for phase 1: Find user profiles associated to organisation. This will yield a .csv or .xlsx file which can be found in this repository under data/users_enriched.xlsx. This file is already manually labeled to exclude irrelevant users which include non-employee students and persons unaffiliated with UU. Due to formatting issues with .csv files, .xlsx files are chosen as the default.
Next, we use the collected information from the UU employee pages to relate employee information back to the collected GitHub profiles. This file can be found in this repository under data/profile_page_uu_without_orgs.csv.
Now, we want to annotate the faculty of each GitHub user with the corresponding employee profile. To do this, follow the instructions in the Jupyter Notebook 1_label_data.ipynb. This is partly automated through the information from the profile_page_uu.csv file mentioned in the previous step, as well as the names users provide themselves on GitHub. The rest of the users and organizations need to be manually annotated. The Jupyter Notebook holds some code to facilitate this. After this step is done, the first phase of user retrieval, labeling, and annotating is done. The resulting file of this step can be found in this repository under data/users_labeled.xlsx
Start phase 2: Collect relevant repositories. As input, use the file data/users_labeled.xlsx. The resulting file can be found in this repository under data/repositories_filtered.xlsx. Two additional columns were manually added: repo_type and note.
Execute the last part after the title Label repositories with faculty information of the Jupyter Notebook 1_label_data.ipynb. This will annotate each repository with the corresponding faculty of the user. The resulting file can be found in this repository under data/repositories_labeled_faculty.xlsx. This is also the fully labeled file.
Start phase 3: Collect further variables. As input, use the labeled repositories under data/repositories_labeled_faculty.xlsx. Each retrieval will result in a separate file with 1 to n variables.

This completes the data retrieval. The next step is to analyze the gathered data. All relevant code for analysis can be found in this repository.

Reproducing Analysis

If the data is made available through the previously described steps, you can simply run the Jupyter Notebook 2_analysis.ipynb from top to bottom. Further detailed information is provided in the file.

Contact

In case of questions, don't hesitate to reach out! You can find more information on how to contact me on my GitHub profile.

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data		data
figs		figs
presentations		presentations
tables		tables
tex		tex
.gitignore		.gitignore
1_label_data.ipynb		1_label_data.ipynb
2_analysis.ipynb		2_analysis.ipynb
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Master_thesis.pdf		Master_thesis.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mapping Research Software Landscapes through Exploratory Studies of GitHub Data

Prerequisites

Reproducing data retrieval

Reproducing Analysis

Contact

License

About

Releases 2

Packages

Languages

License

kequach/Thesis-Mapping-RS

Folders and files

Latest commit

History

Repository files navigation

Mapping Research Software Landscapes through Exploratory Studies of GitHub Data

Prerequisites

Reproducing data retrieval

Reproducing Analysis

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages