Skip to content

Commit a8b2ba1

Browse files
upload project
0 parents  commit a8b2ba1

34 files changed

+2434
-0
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
data/**
2+
models/**
3+
venv/
4+
activate_venv.sh
5+
.idea
6+
__pycache__/
7+
test/

README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
CuREV: Curating Review Comments for Improved Code Review Automation
2+
===============================
3+
This is the replication package accompanying our paper, *Curating Review Comments for Improved Code Review Automation*.
4+
5+
The datasets of this paper are available on [Zenodo](https://zenodo.org/records/14058666).
6+
7+
Overview
8+
---
9+
We propose a methodology to curate a code reviews dataset to enhance its quality and improve the performance of language models on code review downstream tasks, namely comment generation and code refinement.
10+
11+
The main contributions of this work are threefold:
12+
(1) *A data-centric evaluation framework*,
13+
(2) *A curation pipeline to improve the quality of review comments*, and
14+
(3) *Evaluation of the curated dataset, compared to the original, on downstream tasks (i.e, comment generation and code refinement)*.
15+
16+
Project structure
17+
---
18+
The project is structured as follows.
19+
20+
.
21+
├── code_refinement/ # Code refinement package
22+
├── comment_generation/ # Comment generation package
23+
├── quality_assessment/ # empirical study package
24+
├── data_curation/ # dataset curation package
25+
├── util/ # package for helpers and config
26+
├── data/ # Folder for dataset and results
27+
├── models/ # Folder for large language models
28+
├── requirements.txt # required python libraries
29+
30+
31+
32+
Environment setup
33+
---
34+
To facilitate usage and results replication, we include a file ```requirements.txt``` to install the required python libraries.
35+
Here are the instructions to create a virtual environment, activate it, and install dependencies using the provided `requirements.txt` file:
36+
37+
1. **Create a Virtual Environment**
38+
Run the following command to create a virtual environment named `venv`:
39+
```bash
40+
python3 -m venv venv
41+
```
42+
43+
2. **Activate the Virtual Environment**
44+
- On **macOS/Linux**:
45+
```bash
46+
source venv/bin/activate
47+
```
48+
- On **Windows**:
49+
```bash
50+
.\venv\Scripts\activate
51+
```
52+
53+
3. **Install Dependencies**
54+
With the virtual environment activated, install the required Python libraries from `requirements.txt`:
55+
```bash
56+
pip install -r requirements.txt
57+
```
58+
59+
4. **Verify the Installation**
60+
To confirm that all dependencies are installed correctly, run:
61+
```bash
62+
pip list
63+
```
64+
65+
5. **Deactivating the Environment**
66+
When you’re finished, you can deactivate the virtual environment with:
67+
```bash
68+
deactivate
69+
```
70+
71+
Data
72+
---
73+
The original code review dataset is available in [Zenodo](https://zenodo.org/records/14058666).
74+
To run the experiments, you need to download ```Code_Refinement.zip``` and place the dataset under the ```data/``` folder.
75+
You can use the utilities method *create_HFdataset* in ```util.dataset``` to merge the downloaded jsonl files into a HuggingFace dataset.
76+
77+
Models
78+
---
79+
We run *Llama-3.1-70B* on our local machines using [ExLlamaV2](https://github.com/turboderp/exllamav2) to geneerate accurate judgments using our defined evaluation framework.
80+
You can choose the [same model](https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4) or any other model.
81+
Or, you can to download a quantized version of any other model that is compatible with *ExLlamaV2*.
82+
The downloaded model should be placed under the folder ```models/```.
83+
84+
85+
1- A data-centric evaluation framework
86+
---
87+
88+
We propose an evaluation framework to categorize and assess the quality of code reviews. It consists of (1) a **categorization scheme** to classify the *type*, *nature*, and *civility* of code review comments, and (2) **scoring criteria** to assess the overall quality of code reviews based on their *relevance*, *clarity*, and *conciseness*. We apply our evaluation framework to the largest existing dataset of code reviews. Given the scale of the dataset, we utilize a large language model (LLM) as a judge to automatically annotate samples with thoroughly designed prompts to ensure reliable and consistent annotations.
89+
90+
The experiments conducted for this contribution are available under the folder ```quality_assessment/```.
91+
92+
To run the LLM judgments:
93+
```bash
94+
python quality_assessment/inference.py \
95+
--model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
96+
--dataset_path="data/Code_Refinement/CRdataset" \
97+
--save_steps=5000
98+
```
99+
The full list of arguments is available in ```util/config.py```.
100+
101+
102+
2- CuREV: a curated dataset for code review
103+
---
104+
105+
The experiments conducted for this contribution are available under the folder ```data_curation/```.
106+
107+
To run the experiments for reformulating review comments:
108+
```bash
109+
python reformulate_reviews/inference.py \
110+
--model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
111+
--dataset_path="data/Code_Refinement/CRdataset" \
112+
--output_path="data/eval_results/reform_results.jsonl" \
113+
--save_steps=5000
114+
```
115+
The full list of arguments is available in ```util/config.py```.
116+
117+
118+
3-a. Comment generation
119+
---
120+
121+
The experiments conducted for this contribution are available under the folder ```comment_generation/```.
122+
123+
- To train a language model on comment generation on the original dataset:
124+
```bash
125+
python comment_generation/sft_init.py \
126+
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
127+
--dataset_path="data/Code_Refinement/CRdataset" \
128+
--output_path="data/eval_results/reform_results.jsonl" \
129+
--save_steps=200 \
130+
--checkpoint_path="models/comment_generation/init_ckpts" \
131+
--output_path="models/comment_generation/final_model"
132+
```
133+
134+
- To train a language model on comment generation on the original dataset:
135+
```bash
136+
python comment_generation/sft_cur.py \
137+
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
138+
--dataset_path="data/Code_Refinement/CRdataset_reform" \
139+
--output_path="data/eval_results/reform_results.jsonl" \
140+
--save_steps=200 \
141+
--checkpoint_path="models/comment_generation/init_ckpts" \
142+
--output_path="models/comment_generation/final_model"
143+
```
144+
145+
- To run the inference on the initial or curated dataset:
146+
```bash
147+
python comment_generation/hf_inference-init.py
148+
python comment_generation/hf_inference-cur.py
149+
```
150+
151+
- To run the evaluation of both model:
152+
```bash
153+
python comment_generation/evaluation.py"
154+
```
155+
156+
157+
- The full list of arguments is available in ```util/config.py```.
158+
159+
160+
3-b. Code refinement
161+
---
162+
163+
The experiments conducted for this contribution are available under the folder ```code_refinement/```.
164+
165+
- To run the inference of a model for code on the initial dataset:
166+
```bash
167+
python comment_generation/hf_inference-init.py \
168+
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
169+
--dataset_path="data/Code_Refinement/CRdataset" \
170+
--output_path="data/eval_results/reform_results.jsonl" \
171+
--save_steps=1000 \
172+
--output_path="models/init_coderef_results.jsonl"
173+
```
174+
175+
- To run the inference of a model for code on the curated dataset:
176+
```bash
177+
python comment_generation/hf_inference-cur.py \
178+
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
179+
--dataset_path="data/Code_Refinement/CRdataset_reform" \
180+
--output_path="data/eval_results/reform_results.jsonl" \
181+
--save_steps=1000 \
182+
--output_path="models/cur_coderef_results.jsonl"
183+
```
184+
185+
- To run the evaluation of both model:
186+
```bash
187+
python comment_generation/evaluate.py"
188+
```
189+
190+
- The full list of arguments is available in ```util/config.py```.

__init__.py

Whitespace-only changes.

code_refinement/code_bleu.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
from typing import List
2+
from codebleu import calc_codebleu
3+
from transformers import AutoTokenizer
4+
5+
languages_map = {
6+
'.cs':'c_sharp',
7+
'cpp': 'cpp',
8+
'py': 'python',
9+
'js': 'javascript',
10+
'php': 'php',
11+
'go': 'go',
12+
'rb': 'ruby',
13+
'c': 'c',
14+
'java': 'java'
15+
}
16+
17+
model_name = "deepseek-ai/deepseek-coder-6.7b-instruct"
18+
tokenizer = AutoTokenizer.from_pretrained(model_name)
19+
20+
def compute_codebleu_avgscore(references: List[List[str]], candidates: List[str], lang: str) -> float:
21+
try:
22+
bleu = calc_codebleu(references, candidates, lang=languages_map[lang], weights=(0.25, 0.25, 0.25, 0.25), tokenizer=tokenizer)
23+
if bleu[ 'dataflow_match_score']==0:
24+
bleu = calc_codebleu(references, candidates, lang=languages_map[lang], weights=(1/3, 1/3, 1/3, 0), tokenizer=tokenizer)
25+
return bleu['codebleu']
26+
except Exception as e:
27+
print(e)
28+
return 0

code_refinement/crystal_bleu.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
from typing import List
2+
from typing import List
3+
from collections import Counter
4+
from nltk.util import ngrams
5+
from crystalbleu import corpus_bleu
6+
import pickle
7+
import os
8+
from dataset import load_dataset
9+
10+
trivial_ngrams_path = 'trivial_ngrams.pkl'
11+
dataset_path = '../data/Code_Refinement/CRdataset_reform'
12+
13+
def compute_trivial_ngrams(dataset_path, trivial_ngrams_path, column, k, n):
14+
dataset = load_dataset(dataset_path)
15+
tokenized_corpus = []
16+
for d in dataset:
17+
tokenized_corpus.extend(d[column].split())
18+
19+
all_ngrams = []
20+
for n in range(1, n+1):
21+
all_ngrams.extend(list(ngrams(tokenized_corpus, n)))
22+
23+
frequencies = Counter(all_ngrams)
24+
trivially_shared_ngrams = dict(frequencies.most_common(k))
25+
26+
with open(trivial_ngrams_path, 'wb') as f:
27+
pickle.dump(trivially_shared_ngrams, f)
28+
29+
return trivially_shared_ngrams
30+
31+
def get_trivial_ngrams(column='oldf', k=500, n=4):
32+
if os.path.exists(trivial_ngrams_path):
33+
with open(trivial_ngrams_path, 'rb') as f:
34+
trivial_ngrams = pickle.load(f)
35+
else:
36+
trivial_ngrams = compute_trivial_ngrams(dataset_path, trivial_ngrams_path, column, k, n)
37+
return trivial_ngrams
38+
39+
def compute_crystalBLEU_avgscore(references: List[List[str]], candidates: List[str], lang):
40+
trivial_ngrams = get_trivial_ngrams()
41+
crystalBLEU_score = corpus_bleu(
42+
references, candidates, ignoring=trivial_ngrams)
43+
return crystalBLEU_score

code_refinement/evaluate.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
import logging
2+
logging.getLogger().setLevel(logging.ERROR)
3+
from code_bleu import compute_codebleu_avgscore
4+
from crystal_bleu import compute_crystalBLEU_avgscore
5+
from dataset import load_dataset
6+
7+
languages = ['.cs', 'cpp', 'py', 'js', 'php', 'go', 'rb', 'c', 'java']
8+
9+
def is_exactMatch(true_code, gen_code):
10+
true_code = true_code.strip().split('\n')
11+
true_code = [c.strip() for c in true_code if c.strip().startswith('+') or c.strip().startswith('-')]
12+
true_code = '\n'.join(true_code)
13+
true_code = ' '.join(true_code.split())
14+
15+
gen_code = gen_code.strip().split('\n')
16+
gen_code = [c.strip() for c in gen_code if c.strip().startswith('+') or c.strip().startswith('-')]
17+
gen_code = '\n'.join(gen_code)
18+
gen_code = ' '.join(gen_code.split())
19+
20+
21+
return true_code == gen_code
22+
23+
def preprocess_code(code):
24+
code = code.strip().split('\n')
25+
code = [c.strip() for c in code if c.strip().startswith('+') or c.strip().startswith('-')]
26+
code = '\n'.join(code)
27+
code = ' '.join(code.split())
28+
return code
29+
30+
def preprocess_gen_code(code, hunk):
31+
code = code.strip().split('\n')
32+
code = [c.strip() for c in code]
33+
code = '\n'.join(code)
34+
code = ' '.join(code.split())
35+
return code
36+
37+
def preprocess_hunk(code):
38+
code = code.strip().split('\n')
39+
code = [c.strip() for c in code]
40+
code = '\n'.join(code)
41+
code = ' '.join(code.split())
42+
return code
43+
44+
45+
def evaluate(data):
46+
samples = []
47+
for lang in languages:
48+
temp = []
49+
for d in data:
50+
if d['lang'] == lang:
51+
temp.append(d)
52+
samples.append(temp)
53+
54+
codebleus = []
55+
exact_matches = []
56+
57+
for i, sample in enumerate(samples):
58+
references = [[preprocess_code(example['hunk'])] for example in sample]
59+
candidates = [preprocess_code(example['generated_code']) for example in sample]
60+
temp = [compute_codebleu_avgscore([reference], [candidate], languages[i]) for reference, candidate in zip(references, candidates)]
61+
codebleu = sum(temp) / len(temp)
62+
# codebleu = compute_crystalBLEU_avgscore(references, candidates, lang=languages[i])
63+
codebleus.append(codebleu)
64+
65+
# Calculate Exact Match
66+
exact_match = sum(1 for ref, cand in zip(references, candidates) if ref[0]==cand)
67+
exact_matches.append(exact_match)
68+
69+
70+
# Print CodeBLEU and Exact Match results
71+
print("CodeBLEU scores per language:", codebleus)
72+
print("Average CodeBLEU:", sum(codebleus) / len(codebleus))
73+
print("Exact Match scores per language (%):", exact_matches)
74+
print("Total Exact Match:", sum(exact_matches))
75+
76+
77+
78+
79+
if __name__ == '__main__':
80+
print('### Initial dataset ###')
81+
data = load_dataset('../data/refinement_results/final/init_refinement_20k')
82+
evaluate(data)
83+
84+
print('### Curated dataset ###')
85+
data = load_dataset('../data/refinement_results/final/cur_refinement_20k')
86+
evaluate(data)

0 commit comments

Comments
 (0)