OussamaSghaier
diff --git a/‎.gitignore
Lines changed: 7 additions & 0 deletions b/‎.gitignore
Lines changed: 7 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 190 additions & 0 deletions b/‎README.md
Lines changed: 190 additions & 0 deletions
diff --git a/‎__init__.py b/‎__init__.py
diff --git a/‎code_refinement/code_bleu.py
Lines changed: 28 additions & 0 deletions b/‎code_refinement/code_bleu.py
Lines changed: 28 additions & 0 deletions
diff --git a/‎code_refinement/crystal_bleu.py
Lines changed: 43 additions & 0 deletions b/‎code_refinement/crystal_bleu.py
Lines changed: 43 additions & 0 deletions
diff --git a/‎code_refinement/evaluate.py
Lines changed: 86 additions & 0 deletions b/‎code_refinement/evaluate.py
Lines changed: 86 additions & 0 deletions
@@ -0,0 +1,7 @@
+data/**
+models/**
+venv/
+activate_venv.sh
+.idea
+__pycache__/
+test/
@@ -0,0 +1,190 @@
+CuREV: Curating Review Comments for Improved Code Review Automation
+===============================
+This is the replication package accompanying our paper, *Curating Review Comments for Improved Code Review Automation*.
+
+The datasets of this paper are available on [Zenodo](https://zenodo.org/records/14058666).
+
+Overview
+---
+We propose a methodology to curate a code reviews dataset to enhance its quality and improve the performance of language models on code review downstream tasks, namely comment generation and code refinement.
+
+The main contributions of this work are threefold: 
+(1) *A data-centric evaluation framework*, 
+(2) *A curation pipeline to improve the quality of review comments*, and 
+(3) *Evaluation of the curated dataset, compared to the original, on downstream tasks (i.e, comment generation and code refinement)*.
+
+Project structure
+---
+The project is structured as follows.
+    
+    .
+    ├── code_refinement/        # Code refinement package
+    ├── comment_generation/     # Comment generation package
+    ├── quality_assessment/        # empirical study package
+    ├── data_curation/          # dataset curation package
+    ├── util/                   # package for helpers and config
+    ├── data/                   # Folder for dataset and results
+    ├── models/                 # Folder for large language models
+    ├── requirements.txt        # required python libraries
+
+
+
+Environment setup
+---
+To facilitate usage and results replication, we include a file ```requirements.txt``` to install the required python libraries.
+Here are the instructions to create a virtual environment, activate it, and install dependencies using the provided `requirements.txt` file:
+
+1. **Create a Virtual Environment**  
+   Run the following command to create a virtual environment named `venv`:
+   ```bash
+   python3 -m venv venv
+   ```
+
+2. **Activate the Virtual Environment**  
+   - On **macOS/Linux**:
+     ```bash
+     source venv/bin/activate
+     ```
+   - On **Windows**:
+     ```bash
+     .\venv\Scripts\activate
+     ```
+
+3. **Install Dependencies**  
+   With the virtual environment activated, install the required Python libraries from `requirements.txt`:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+4. **Verify the Installation**  
+   To confirm that all dependencies are installed correctly, run:
+   ```bash
+   pip list
+   ```
+
+5. **Deactivating the Environment**  
+   When you’re finished, you can deactivate the virtual environment with:
+   ```bash
+   deactivate
+   ```
+
+Data
+---
+The original code review dataset is available in [Zenodo](https://zenodo.org/records/14058666).
+To run the experiments, you need to download ```Code_Refinement.zip``` and place the dataset under the ```data/``` folder.
+You can use the utilities method *create_HFdataset* in ```util.dataset``` to merge the downloaded jsonl files into a HuggingFace dataset. 
+
+Models
+---
+We run *Llama-3.1-70B* on our local machines using [ExLlamaV2](https://github.com/turboderp/exllamav2) to geneerate accurate judgments using our defined evaluation framework.
+You can choose the [same model](https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4) or any other model.
+Or, you can to download a quantized version of any other model that is compatible with *ExLlamaV2*.
+The downloaded model should be placed under the folder ```models/```.
+
+
+1- A data-centric evaluation framework
+---
+
+We propose an evaluation framework to categorize and assess the quality of code reviews. It consists of (1) a **categorization scheme** to classify the *type*, *nature*, and *civility* of code review comments, and (2) **scoring criteria** to assess the overall quality of code reviews based on their *relevance*, *clarity*, and *conciseness*. We apply our evaluation framework to the largest existing dataset of code reviews. Given the scale of the dataset, we utilize a large language model (LLM) as a judge to automatically annotate samples with thoroughly designed prompts to ensure reliable and consistent annotations.
+
+The experiments conducted for this contribution are available under the folder ```quality_assessment/```.
+
+To run the LLM judgments:
+```bash
+python quality_assessment/inference.py \
+      --model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
+      --dataset_path="data/Code_Refinement/CRdataset" \
+      --save_steps=5000
+```
+The full list of arguments is available in ```util/config.py```.
+
+
+2- CuREV: a curated dataset for code review
+---
+
+The experiments conducted for this contribution are available under the folder ```data_curation/```.
+
+To run the experiments for reformulating review comments:
+```bash
+python reformulate_reviews/inference.py \
+      --model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
+      --dataset_path="data/Code_Refinement/CRdataset" \
+      --output_path="data/eval_results/reform_results.jsonl" \
+      --save_steps=5000
+```
+The full list of arguments is available in ```util/config.py```.
+
+
+3-a. Comment generation
+---
+
+The experiments conducted for this contribution are available under the folder ```comment_generation/```.
+
+- To train a language model on comment generation on the original dataset:
+```bash
+python comment_generation/sft_init.py \
+      --model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
+      --dataset_path="data/Code_Refinement/CRdataset" \
+      --output_path="data/eval_results/reform_results.jsonl" \
+      --save_steps=200 \
+      --checkpoint_path="models/comment_generation/init_ckpts" \
+      --output_path="models/comment_generation/final_model"
+```
+
+- To train a language model on comment generation on the original dataset:
+```bash
+python comment_generation/sft_cur.py \
+      --model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
+      --dataset_path="data/Code_Refinement/CRdataset_reform" \
+      --output_path="data/eval_results/reform_results.jsonl" \
+      --save_steps=200 \
+      --checkpoint_path="models/comment_generation/init_ckpts" \
+      --output_path="models/comment_generation/final_model"
+```
+
+- To run the inference on the initial or curated dataset:
+```bash
+python comment_generation/hf_inference-init.py
+python comment_generation/hf_inference-cur.py
+```
+
+- To run the evaluation of both model:
+```bash
+python comment_generation/evaluation.py"
+```
+
+
+- The full list of arguments is available in ```util/config.py```.
+
+
+3-b. Code refinement
+---
+
+The experiments conducted for this contribution are available under the folder ```code_refinement/```.
+
+- To run the inference of a model for code on the initial dataset:
+```bash
+python comment_generation/hf_inference-init.py \
+      --model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
+      --dataset_path="data/Code_Refinement/CRdataset" \
+      --output_path="data/eval_results/reform_results.jsonl" \
+      --save_steps=1000 \
+      --output_path="models/init_coderef_results.jsonl"
+```
+
+- To run the inference of a model for code on the curated dataset:
+```bash
+python comment_generation/hf_inference-cur.py \
+      --model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
+      --dataset_path="data/Code_Refinement/CRdataset_reform" \
+      --output_path="data/eval_results/reform_results.jsonl" \
+      --save_steps=1000 \
+      --output_path="models/cur_coderef_results.jsonl"
+```
+
+- To run the evaluation of both model:
+```bash
+python comment_generation/evaluate.py"
+```
+
+- The full list of arguments is available in ```util/config.py```.
@@ -0,0 +1,28 @@
+from typing import List
+from codebleu import calc_codebleu
+from transformers import AutoTokenizer
+
+languages_map = {
+    '.cs':'c_sharp', 
+    'cpp': 'cpp', 
+    'py': 'python', 
+    'js': 'javascript', 
+    'php': 'php', 
+    'go': 'go', 
+    'rb': 'ruby', 
+    'c': 'c', 
+    'java': 'java'
+    }
+
+model_name = "deepseek-ai/deepseek-coder-6.7b-instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+def compute_codebleu_avgscore(references: List[List[str]], candidates: List[str], lang: str) -> float:
+    try:
+        bleu = calc_codebleu(references, candidates, lang=languages_map[lang], weights=(0.25, 0.25, 0.25, 0.25), tokenizer=tokenizer)
+        if bleu[ 'dataflow_match_score']==0:
+            bleu = calc_codebleu(references, candidates, lang=languages_map[lang], weights=(1/3, 1/3, 1/3, 0), tokenizer=tokenizer)
+        return bleu['codebleu']
+    except Exception as e:
+        print(e)
+        return 0
@@ -0,0 +1,43 @@
+from typing import List
+from typing import List
+from collections import Counter
+from nltk.util import ngrams
+from crystalbleu import corpus_bleu
+import pickle
+import os
+from dataset import load_dataset
+
+trivial_ngrams_path = 'trivial_ngrams.pkl'
+dataset_path = '../data/Code_Refinement/CRdataset_reform'
+
+def compute_trivial_ngrams(dataset_path, trivial_ngrams_path, column, k, n):
+    dataset = load_dataset(dataset_path)
+    tokenized_corpus = []
+    for d in dataset:
+        tokenized_corpus.extend(d[column].split())
+
+    all_ngrams = []
+    for n in range(1, n+1):
+        all_ngrams.extend(list(ngrams(tokenized_corpus, n)))
+
+    frequencies = Counter(all_ngrams)
+    trivially_shared_ngrams = dict(frequencies.most_common(k))
+
+    with open(trivial_ngrams_path, 'wb') as f:
+        pickle.dump(trivially_shared_ngrams, f)
+
+    return trivially_shared_ngrams
+
+def get_trivial_ngrams(column='oldf', k=500, n=4):
+    if os.path.exists(trivial_ngrams_path):
+        with open(trivial_ngrams_path, 'rb') as f:
+            trivial_ngrams = pickle.load(f)
+    else:
+        trivial_ngrams = compute_trivial_ngrams(dataset_path, trivial_ngrams_path, column, k, n)
+    return trivial_ngrams
+
+def compute_crystalBLEU_avgscore(references: List[List[str]], candidates: List[str], lang):
+    trivial_ngrams = get_trivial_ngrams()
+    crystalBLEU_score = corpus_bleu(
+        references, candidates, ignoring=trivial_ngrams)
+    return crystalBLEU_score
@@ -0,0 +1,86 @@
+import logging
+logging.getLogger().setLevel(logging.ERROR)
+from code_bleu import compute_codebleu_avgscore
+from crystal_bleu import compute_crystalBLEU_avgscore
+from dataset import load_dataset
+
+languages = ['.cs', 'cpp', 'py', 'js', 'php', 'go', 'rb', 'c', 'java']
+
+def is_exactMatch(true_code, gen_code):
+    true_code = true_code.strip().split('\n')
+    true_code = [c.strip() for c in true_code if c.strip().startswith('+') or c.strip().startswith('-')]
+    true_code = '\n'.join(true_code)
+    true_code = ' '.join(true_code.split())
+    
+    gen_code = gen_code.strip().split('\n')
+    gen_code = [c.strip() for c in gen_code if c.strip().startswith('+') or c.strip().startswith('-')]
+    gen_code = '\n'.join(gen_code)
+    gen_code = ' '.join(gen_code.split())
+
+
+    return true_code == gen_code
+    
+def preprocess_code(code):
+    code = code.strip().split('\n')
+    code = [c.strip() for c in code if c.strip().startswith('+') or c.strip().startswith('-')]
+    code = '\n'.join(code)
+    code = ' '.join(code.split())
+    return code
+
+def preprocess_gen_code(code, hunk):
+    code = code.strip().split('\n')
+    code = [c.strip() for c in code]
+    code = '\n'.join(code)
+    code = ' '.join(code.split())
+    return code
+
+def preprocess_hunk(code):
+    code = code.strip().split('\n')
+    code = [c.strip() for c in code]
+    code = '\n'.join(code)
+    code = ' '.join(code.split())
+    return code
+
+
+def evaluate(data):
+    samples = []
+    for lang in languages:
+        temp = []
+        for d in data:
+            if d['lang'] == lang:
+                temp.append(d)
+        samples.append(temp)
+
+    codebleus = []
+    exact_matches = []
+
+    for i, sample in enumerate(samples):
+        references = [[preprocess_code(example['hunk'])] for example in sample]
+        candidates = [preprocess_code(example['generated_code']) for example in sample]
+        temp = [compute_codebleu_avgscore([reference], [candidate], languages[i]) for reference, candidate in zip(references, candidates)]
+        codebleu = sum(temp) / len(temp)
+        # codebleu = compute_crystalBLEU_avgscore(references, candidates, lang=languages[i])
+        codebleus.append(codebleu)
+
+        # Calculate Exact Match
+        exact_match = sum(1 for ref, cand in zip(references, candidates) if ref[0]==cand)
+        exact_matches.append(exact_match)
+
+
+    # Print CodeBLEU and Exact Match results
+    print("CodeBLEU scores per language:", codebleus)
+    print("Average CodeBLEU:", sum(codebleus) / len(codebleus))
+    print("Exact Match scores per language (%):", exact_matches)
+    print("Total Exact Match:", sum(exact_matches))
+
+
+
+
+if __name__ == '__main__':
+    print('### Initial dataset ###')
+    data = load_dataset('../data/refinement_results/final/init_refinement_20k')
+    evaluate(data)
+
+    print('### Curated dataset ###')
+    data = load_dataset('../data/refinement_results/final/cur_refinement_20k')
+    evaluate(data)