110_dataSources.Rmd

# (PART) Data Sources and Metrics and Standards in Software Engineering Defect Prediction {-}

# Data Sources in Software Engineering

We classify this trail in the following categories:

* _Source code_ can be studied to measure its properties, such as size or complexity.

* _Source Code Management Systems_ (SCM) make it possible to store all the changes that the different source code files undergo during the project. Also, SCM systems allow for work to be done in parallel by different developers over the same source code tree. Every change recorded in the system is accompanied with meta-information (author, date, reason for the change, etc) that can be used for research purposes.

* _Issue or Bug tracking systems_ (ITS). Bugs, defects and user requests are managed in ISTs, where users and developers can fill tickets with a description of a defect found, or a desired new functionality. All the changes to the ticket are recorded in the system, and most of the systems also record the comments and communications among all the users and developers implied in the task. 

* _Messages_ between developers and users. In the case of free/open source software, the projects are open to the world, and the messages are archived in the form of mailing lists and social networks which can also be mined for research purposes. There are also some other open message systems, such as IRC or forums. 

* _Meta-data about the projects_. As well as the low level information of the software processes, we can also find meta-data about the software projects which can be useful for research. This meta-data may include intended-audience, programming language, domain of application, license (in the case of open source), etc.


![Metadata (source: Israel Herraiz)](figures/artifactsMeta.png)

* _Usage data_. There are statistics about software downloads, logs from servers, software reviews, etc. 


Types of information stored in the repositories:

  * Meta-information about the project itself and the
people that participated.

    + Low-level information

        * Mailing Lists (ML)

        * Bug Tracking Systems (BTS) or Project Tracker System (PTS)

        * Software Configuration Management Systems (SCM)

    + Processed information. For example project management information about the effort estimation and cost of the project.

* Whether the repository is public or not

* Single project vs. multiprojects. Whether the repository contains information of a single project with multiples versions or multiples projects and/or versions.

* Type of content, open source or industrial projects

* Format in which the information is stored and formats or technologies for accessing the information:
  
    + Text. It can be just plain text, CSV (Comma Separated Values) files, Attribute-Relation File Format
(ARFF) or its variants
    
    + Through databases. Downloading dumps of the database.
    
    + Remote access such as APIs of Web services or REST


# Repositories


There is a number of open research repositories in Software Engineering. Among them:

  + Zenodo. It is becoming a popular site for publishing datasets associated with papers. It provides DOIs for referencing data and code:
  [https://zenodo.org/](https://zenodo.org/)
  
  + Spinellis maintais a curated repository on Github:
  [https://github.com/dspinellis/awesome-msr](https://github.com/dspinellis/awesome-msr)

  + PROMISE (PRedictOr Models In Software Engineering). There is a conference with this name ([Promise Conference](https://promiseconf.github.io/))
  
 Some popular datasets used as benchmarking in may paper can still be found on: 
[http://promise.site.uottawa.ca/SERepository/datasets-page.html](http://promise.site.uottawa.ca/SERepository/datasets-page.html)
The is some well-known issues wit the NASA datasets and the source code is not available.

<!--  + Finding Faults using Ensemble Learners (ELFF) [@Shippey2016Esem]
  [http://www.elff.org.uk/](http://www.elff.org.uk/)-->

  + FLOSSMole [@HCC06] 
[http://flossmole.org/](http://flossmole.org/)

  + FLOSSMetrics [@herraiz2009flossmetrics]: 
[http://flossmetrics.org/](http://flossmetrics.org/)

  + Qualitas Corpus (QC) [@QualitasCorpus2010]: 
[http://qualitascorpus.com/](http://qualitascorpus.com/)

  + Sourcerer Project [@LBNRB09]: 
[http://sourcerer.ics.uci.edu/](http://sourcerer.ics.uci.edu/)

  + Ultimate Debian Database (UDD) [@NZ10] 
[http://udd.debian.org/](http://udd.debian.org/)

  + SourceForge Research Data Archive (SRDA) [@VanAntwerpM2008] 
[http://zerlot.cse.nd.edu/](http://zerlot.cse.nd.edu/)

<!--  + SECOLD (Source code ECOsystem Linked Data): 
    [http://www.secold.org/](http://www.secold.org/)
-->

  + Software-artifact Infrastructure Repository (SIR) 
[http://sir.unl.edu]

  + OpenHub:
[https://www.openhub.net/](https://www.openhub.net/)


Not openly available (and mainly for effort estimation):

  + The International Software Benchmarking Standards Group (ISBSG)
    [http://www.isbsg.org/](http://www.isbsg.org/)
  
<!--  + TukuTuku
    [http://www.metriq.biz/tukutuku/](http://www.metriq.biz/tukutuku/)
-->

Some papers and publications/theses that have been used in the literature:

  + Helix Data Set [@Vasa2010]: 
    [http://www.ict.swin.edu.au/research/projects/helix/](http://www.ict.swin.edu.au/research/projects/helix/)

  + Bug Prediction Dataset (BPD) [@DAmb2010a,@ALR11]: 
    [http://bug.inf.usi.ch/](http://bug.inf.usi.ch/)

  + Eclipse Bug Data (EBD) [@ZPZ07,@NZZH12]: 
    [http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/](http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/)


# Open Tools/Dashboards to extract data

Process to extract data:

![Process](./figures/process.png)

Within the open source community, several toolkits allow us to extract data that can be used to explore projects:

Metrics Grimoire
[http://metricsgrimoire.github.io/](http://metricsgrimoire.github.io/)


![Grimoire](figures/Grimoire.png)

SonarQube
[http://www.sonarqube.org/](http://www.sonarqube.org/)


![SonarQube](figures/sonarQube.png)


CKJM (OO Metrics tool)
[http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/](http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/)

Collects a large number of object-oriented metrics from code.

## Issues

There are problems such as different tools report different values for the same metric [@Lincke2008]

It is well-know that the NASA datasets have some problems:

 + [@Gray2011] The misuse of the NASA metrics data program data sets for automated software defect prediction

 + [@Shepperd2013] Data Quality: Some Comments on the NASA Software Defect Datasets