Skip to content

Latest commit

 

History

History
43 lines (40 loc) · 1.6 KB

README.md

File metadata and controls

43 lines (40 loc) · 1.6 KB

Evaluation of CodeBertScore

This folder contains the full evaluation pipeline on the correlation with functional correctness

cd evaluation
LANG=java # cpp, python, js
MODEL_LANG=java # if LANG is js, use javascript
LAYER=7

1. Data preparation

We construct multilingual humaneval dataset from multipl-E and humaneval-x

python process_data.py \
    --lang LANG \
    --config davinci-0.8-keep 

This script will take

  1. generation results provided by multipl-e (example)
  2. reference code in the corresponding language from humaneval-x (example) and construct text file of source, reference and target (example)

2. Calculate CodeBertScore

python run_score.py \
    --lang $LANG \
    --model neulab/codebert-$MODEL_LANG \
    --device cuda:0 \
    --d_folder data/humaneval_$LANG_davinci-0.8-keep \
    --d_prefix humaneval \
    --idf_path data/idf/$LANG_idf.pkl \
    --layer $LAYER 

The detailed configurations for each language are provided here

3. Calculate correlation with functional correctness

python calculate_correlation.py \
    --lang $LANG \
    --d_folder data/humaneval_$LANG_davinci-0.8-keep \
    --d_prefix humaneval \
    --result_file humaneval_codebert-$MODEL_LANG_L$LAYER_idf.score.json

It will output the kental-tau, spearman and pearson correlation with functional correctness.