Use the productivity
branch for the latest updates.
Add to the root a folder named Resources/
with the following files:
repositories.txt
containing the list of projects (one per line) to be analyzed, in the following formatorg/repo_name
(e.g., `atom/atom);tokens.txt
(optional) containing the list of GH tokens to be used;
Refer to this README.md file.
Refer to this README.md file.
Uses the tokens defined in Resources/tokens.txt
and the list of repository urls in Resources/repositories.txt
, as defined in the Settings.py
file.
- None.
- Set files and folders names in the
Settings.py
file
python CommitExtractor.py
logs/Commit_Extraction_organization.log
: log fileOrganizations/<organization>/[<repo1>...<repoN>]/
: Results folders- For each repo folder:
commit_list.csv
: List of the commits in the format: <SHA; author_id; date>commit_history_table.csv
: Matrix of autors and dates. The cells contain the number of the commits of a developer in one daypauses_duration_list.csv
: List of pauses durations in days for each developer in the format: <dev; listOfDurations>pauses_dates_list.csv
: List of pauses dates for each developer in the format: <dev; listOfPauseDates>
- The same files are given after merging the commits of every organization's repo in the
Organizations/<organization>/
folder.
if you came here from point 2 of core selection you can now perform step 3 following (CoreSelection | Step 3)
- None
- Set files and folders names in the
Settings.py
file
python ActivitiesExtractor.py
logs/Commit_Extraction_organization.log
: log fileOrganizations/<organization>/[<repo1>...<repoN>]/Other_Activities/
: Results folders- For each repo folder:
issues_comments_repo.csv
: List of the issue comments in the format: <id; date; creator_login>issues_events_repo.csv
: List of the issue events in the format: <id; date; creator_login>issues_prs_repo.csv
: List of the issue and pull request creations in the format: <id; date; creator_login>pulls_comments_repo.csv
: List of the pull request comments in the format: <id; date; creator_login>
mode
: enter one of following modes ['tf', 'a80', 'a80mod', 'a80api']
- Set files and folders names in the
Settings.py
file - Insert the list of the TF/core developers (<TF_developers_file>) in the right folder. Formatted as a list of <name;login>. The path to save the file is set in the
Settings.py
file. - Set the
window
size and theshift
size in theSettings.py
file
python BreaksIdentification.py tf | a80 | a80mod | a80api
logs/Breaks_Identification.log
: log fileOrganizations/<organization>/Dev_Breaks/
: Results folders- For each developer in the TF file:
<devLogin>_breaks.csv
: List of the breaks in the format: <len; dates; Tfov_used>
Let D be a developer to analyze and let life(D) be the number of days between its first and last commits.
For each sliding window W in life(D) which slides of shift days. The values of variables window (default 90 days) and shift (default 7 days) are set in the Settings.py
file).
The goal is to select all the breaks (pauses that are larger than usual) associated with the Tfov (Far-out-value threshold) of the first window where they have been found:
- PAUSES SELECTION STEP
- In the list
win_pauses
, put all the pauses within W (only these pauses define the rythm of D in W). - In the list
partially_included
, put all the pauses partially within W (i.e., pauses that start in W and end in the next window).
- Tfov DEFINITION STEP
-
If
win_pauses
contains >=4 pauses then the W is valid, then usewin_pauses
to calculate Tfov. If Tfov is valid (i.e., IQR>1), then proceed to the breaks identification step (go to STEP 3). -
Else, when
win_pauses
< 4 (i.e., Tfov cannot be calculated) or if Tfov is invalid (i.e., IQR<=1) for W, then:- If a previous Tfov exists, then consider it as the current Tfov and proceed to the next step for breaks identification (go to STEP 3).
- Otherwise, save into the list
clear_breaks
all the pauses frompartially_included
that are larger than the window size and have not been considered yet, ignore the other pauses inwin_pauses
; move forward W by shift days and RESTART (go back to STEP 1).
(Note: The pauses that are larger than shift days will be considered in the next W and so on, whereas the smaller ones are not breaks and can be safely ignored).
- BREAKS IDENTIFICATION STEP
- Select as break each couple <p, t> from the lists
win_pauses
andpartially_included
where t is Tfov and p is a pause > Tfov.- Move forward W by shift days and RESTART (go back to STEP 1).
- FINAL STEP (When there are no more W)
- Compute Avg_Tfov as the average of all the valid Tfovs found.
- Save the pauses in the list
clear_breaks
as breaks (<p, t> where t is Avg_Tfov, and p is a pause > Avg_Tfov as for list definition).
mode
: choose one of following modes ['tf', 'a80', 'a80mod', 'a80api']
- Make sure to have already executed the
BreaksIdentification.py
script to get the<devLogin>_breaks.csv
files (one for each developer).
python BreaksLabeling.py tf | a80 | a80mod | a80api
logs/Breaks_Labeling.log
: events log fileOrganizations/<organization>/Dev_Breaks/
: Results folders- For each developer in the TF file:
<devLogin>_labeled_breaks.csv
: List of the breaks in the format: <len; dates; Tfov_used; label; previously>
-
Get a break from the
Breaks
list. -
If there is not any other activity performed by the developer during the break, then label it
INACTIVE
if < 365 days;GONE
otherwise. -
If there are other activities in the period:
- Define
sub_breaks_list
as the list of the intervals between such activities (sub_break). - Identify each sub_break > Tfov from the
sub_breaks_list
and label it based on the defined state diagram (∆t_inactive = ∆t_non-coding = Tfov).