PairEval

Official Code Repository for the paper "PAIREVAL: Open-domain Dialogue Evaluation with Pairwise Comparison" (COLM 2024).

Abstract

Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PAIREVAL, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PAIREVAL is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.

QuickStart

Install the following packages.

torch
transformers
accelerate
bitsandbytes
scipy
tqdm

Download our LoRA checkpoints and datasets from here and locate them on the main directory.
Obtain your access to meta-llama/Llama-2-7b-chat-hf.
Execute the following code to evaluate PairEval on the preprocessed turn-level FED meta-evaluation dataset released by this paper.

python inference.py

Check evaluation results on output/ directory.

Evaluation of Custom Dataset

Please reformat your dataset following data/evaluaton/fed_turn.jsonl.
change --eval_data_name argument in args.py.

FAQ

Please make an issue on this repository or directly contact to [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
args.py		args.py
dataset.py		dataset.py
evaluator.py		evaluator.py
inference.py		inference.py
paireval_main.png		paireval_main.png
prompt.py		prompt.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PairEval

Abstract

QuickStart

Evaluation of Custom Dataset

FAQ

About

Releases

Packages

Languages

ddehun/PairEval

Folders and files

Latest commit

History

Repository files navigation

PairEval

Abstract

QuickStart

Evaluation of Custom Dataset

FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages