GitHub - thiagomarquesrocha/siameseQAT: SiameseQAT, a duplicate bug report detection method that considers not only information on individual bugs, but also collective information from bug clusters. SiameseQAT combines attention mechanisms, which were not previously used in this task, with a novel loss function called Quintet Loss, that considers the centroid of duplicate bug report representation clusters andtheir contextual information.

SiameseQAT : A Semantic Context-Based Duplicate Bug Report Detection using Replicated Cluster Information

Paper: https://ieeexplore.ieee.org/document/9380447

Abstract:

In large-scale software development environments, defect reports are maintained through bug tracking systems (BTS) and analyzed by domain experts. Different users may create bug reports in a nonstandard manner, and may report a particular problem using a particular set of words due to stylistic choices and writing patterns.

Therefore, the same defect can be reported with very different descriptions, generating non-trivial duplicates. To avoid redundant work for the development team, an expert needs to look at all new reports while trying to label possible duplicates. However, this approach is neither trivial nor scalable and directly impacts on bug fix correction time. Recent efforts to find duplicate bug reports tend to focus on deep neural approaches that consider hybrid representations of bug reports, using both structured and unstructured information. Unfortunately, these approaches ignore that a single bug can have multiple previously identified duplicates and, therefore, multiple textual descriptions, titles, and categorical information.

In this work, we propose SiameseQAT, a duplicate bug report detection method that considers information on individual bugs as well as information extracted from bug clusters. The SiameseQAT combines context and semantic learning on structured and unstructured features and corpus topic extraction-based features, with a novel loss function called Quintet Loss, which considers the centroid of duplicate clusters and their contextual information. We validated our approach on the well-known open-source software repositories Eclipse, NetBeans, and Open Office, comprised of more than 500 thousand bug reports. We evaluated both the retrieval and classification of duplicates, reporting a Recall@25 mean of 85% for retrieval and 84% AUROC for classification tasks, results that were significantly superior to previous works.

1. PREREQUISITES

Some libraries in python environment are required to enable the source code run properly.

Download Dataset

dataset.zip

# Create on root directory /data
-> mkdir /data
# Unzip on root directory /data
-> unzip dataset.zip
# See on data/normalized/
# - eclipse
# - openoffice
# - netbeans

First, install pipenv

$ pip install pipenv

Install mlflow

$ pip install mlflow==1.18.0

Download and install BERT-uncased model

To run all next steps you will need to dowload the BERT pretrained model uncased_L-12_H-768_A-12. Then, after download you will need to unpack on root directory.

The expected root directory is:

- uncased_L-12_H-768_A-12
- src
- experiment
- tests

If the uncased_L-12_H-768_A-12 directory is not available may cause problems in next steps.

Install dependencies (optional)

$ pipenv install

2. WORKFLOW

2.1 PREPROCESSING

To run the experiments is required to preprocess the datasets.

Example of how to run preprocessing

Parameter: {dataset}

dataset=eclipse
dataset=netbeans
dataset=openoffice

Parameter: {preprocessor}

preprocessor=bert (default)
preprocessor=baseline (not working yet)

mlflow run . --experiment-name preprocessing -e preprocessing -P dataset=eclipse -P preprocessor=bert

The dataset from Lazar et al. (2014) has the following open-source software repositories: eclipse, openoffice and netbeans.

After run all previously steps the following directories will be created in root directory:

data/processed/eclipse
data/processed/openoffice
data/processed/netbeans

For each directory will be create files to train, test, vocabulary corpus and categorical features from bug reports.

bugs/ : a list of pickle objects to save a bug report document in json format. All bugs are saved by id. Ex: 1.pkl, 2.pkl, ..., etc.
train.txt : IDs from bugs that will be used for training
test.txt : IDs from bugs that will be used for test
word_vocab_bert.pkl : dictionary list of words present in dataset saved in pickle format.
bug_ids.txt : IDs from all bugs in the dataset.
normalized_bugs.json : All bugs reports saved in json format normalized.
bug_pairs.txt : list of duplicate pairs available by Lazar et. al. 2014.
bug_severity.dic : dictionary for severities categorical feature.
bug_status.dic : dictionary for for all bug report status.
component.dic : dictionary for for all bug report components.
priority.dic : dictionary for all bug report priorities.
product.dic : dictionary for all bug report products.

2.2 EXPERIMENTS

RETRIEVAL EXPERIMENTS

To train the model to evaluate in retrieval task run the following command. All models are trained on train.txt file and evaluated using test.txt.

Model available:
- model_name=SiameseTA
- model_name=SiameseTAT (Not implemented yet)
- model_name=SiameseQAT-A (Not implemented yet)
- model_name=SiameseQAT-W (Not implemented yet)
- model_name=SiameseQA-A
- model_name=SiameseQA-W
Parameters available:
- model_name: Model name to be used. Ex: SiameseQA-A, SiameseQAT-W, SiameseTA
- domain: Dataset to be used. Ex: eclipse, netbeans, openoffice.
- title_seq: Title length sequence to be used in model.
- desc_seq: Description length sequence to be used in model.
- batch_size: Batch size for training and validation phase.
- epochs: Number of epochs for training.
- bert_layers: Number of bert unfrozen layers for training.
- preprocessing: Type of preprocessing for models. Ex: bert, keras

Example of how to run retrieval experiment

mlflow run . --experiment-name retrieval -e train_retrieval -P model_name=SiameseTA -P domain=eclipse_test -P title_seq=1 -P desc_seq=1 -P batch_size=1 -P bert_layers=1

CLASSIFICATION EXPERIMENTS

To train the model to evaluate in classification task run the following command. Note that all models are trained on train.txt file and evaluated using test.txt.

Example of how to run classification experiment

Note that run_id_retrieval has a already valid id.

mlflow run . --experiment-name classification -e train_classification -P run_id_retrieval=66f2b01699474634bd9e6559244c4d26 -P domain=eclipse_test -P batch_size=3 -P
epochs=1

2.3 RESULTS

All experiments are recorded and available on mlflow UI localhost:5000 after run the command mlflow ui in terminal. You will see the following experiments tabs:

retrieval
classification

Then, if do you point to any previous execution, the run_id created through the mlflow can be used to collect the results and their artifacts.

Tests

Export the root directory . to PYTHONPATH

$ export PYTHONPATH=. # Linux
$ set PYTHONPATH=. # Windows

Run tests

To run all tests you will need BERT pretrained uncased_L-12_H-768_A-12. Download and unpack on root directory.

$ pipenv run pytest tests

Run tests looking the DEBUG level messages

$ pipenv run pytest --log-cli-level=DEBUG tests

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
.github/workflows		.github/workflows
data/normalized/eclipse_test		data/normalized/eclipse_test
experiment		experiment
images		images
src		src
tests		tests
.gitignore		.gitignore
MLproject		MLproject
MakeFile		MakeFile
PipFile		PipFile
PipFile.lock		PipFile.lock
README.md		README.md
conda.yaml		conda.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SiameseQAT : A Semantic Context-Based Duplicate Bug Report Detection using Replicated Cluster Information

1. PREREQUISITES

2. WORKFLOW

2.1 PREPROCESSING

Example of how to run preprocessing

2.2 EXPERIMENTS

RETRIEVAL EXPERIMENTS

CLASSIFICATION EXPERIMENTS

2.3 RESULTS

Tests

About

Releases 7

Packages

Languages

thiagomarquesrocha/siameseQAT

Folders and files

Latest commit

History

Repository files navigation

SiameseQAT : A Semantic Context-Based Duplicate Bug Report Detection using Replicated Cluster Information

1. PREREQUISITES

2. WORKFLOW

2.1 PREPROCESSING

Example of how to run preprocessing

2.2 EXPERIMENTS

RETRIEVAL EXPERIMENTS

CLASSIFICATION EXPERIMENTS

2.3 RESULTS

Tests

About

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages