Language Model Preference Evaluation with Multiple Weak Evaluators

This paper introduces GED (Preference Graph Ensemble and Denoise), a method designed to improve the evaluation of large language models' (LLMs) outputs by ensembling multiple weak evaluators and applying denoising techniques to resolve cyclic inconsistencies in preference graphs, resulting in more reliable, non-contradictory preference evaluations

Setup

Install all required dependencies to ensure all scripts function correctly.

pip install -r requirements.txt

Rank result generation

python rank_gen.py \
    --eval_model $eval_model \
    --answer_model $answer_model \
    --task_name $task_name \
    --w_type $w_type \
    --rank_type $rank_type

--eval_model: The model used for evaluation. (Like: 'llama3-8b').
--answer_model: The model generating the answers. (Like: 'qwen1.5-32b').
--task_name: The task for evaluation. (Like: '10k-ultra').
--rank_type: The ranking method. (Like: 'pairwise_majority').
--ensemble_type: The type of ensemble method used. (Like: 'graph_ensemble').

This script generates updated rankings, denoising conflicting evaluations from the weak evaluators to produce reliable results.

Name	Name	Last commit message	Last commit date
Latest commit zhengyuhu-01 up Oct 17, 2024 ccb99c8 · Oct 17, 2024 History 2 Commits
images	images	up	Oct 17, 2024
labelling	labelling	up	Oct 17, 2024
README.md	README.md	up	Oct 17, 2024
__init__.py	__init__.py	up	Oct 17, 2024
mallows.py	mallows.py	up	Oct 17, 2024
ptranking_wrapper.py	ptranking_wrapper.py	up	Oct 17, 2024
rank_gen.py	rank_gen.py	up	Oct 17, 2024
ranking_digraph.py	ranking_digraph.py	up	Oct 17, 2024
ranking_utils.py	ranking_utils.py	up	Oct 17, 2024
requirements.txt	requirements.txt	up	Oct 17, 2024
synth_ranking_utils.py	synth_ranking_utils.py	up	Oct 17, 2024
ws_lib.py	ws_lib.py	up	Oct 17, 2024
ws_ranking.py	ws_ranking.py	up	Oct 17, 2024
ws_real_workflow.py	ws_real_workflow.py	up	Oct 17, 2024
ws_regression.py	ws_regression.py	up	Oct 17, 2024
ws_synth_cls_workflow.py	ws_synth_cls_workflow.py	up	Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Model Preference Evaluation with Multiple Weak Evaluators

Setup

Rank result generation

About

Releases

Packages

Contributors 2

Languages

ppsmk388/P-GED

Folders and files

Latest commit

History

Repository files navigation

Language Model Preference Evaluation with Multiple Weak Evaluators

Setup

Rank result generation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages