GitHub - CyberAgentAILab/regularized-bon: Code of "Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment" (2024).

Regularized Best-of-N

Implementation of Regularized Best-of-N (RBoN).

The code is tested on Ubuntu 20.04 using Python 3.8 and CUDA 11.0 (Docker image nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04).

git clone [email protected]:CyberAgentAILab/regularized-bon
cd regularized-bon
pip install -r requirements.txt

Usage

Running RBoN takes multiple steps.

First you generate a set of responses using sample.sh. We use the same set of samples generated for all the algorithms for fair comparison.
Compute Wasserstein distance and KL divergence using compute_wd.sh and compute_logprob.sh.
Compute the reward of the responses.
Run mbr/compute_rbon.py to compute RBoN-WD and RBoN-KL.

You get the CSV file in the results/ directory.

Sampling candidates

By default, it runs using openai-community/gpt2. Add -m [MODEL NAME IN HUGGINGFACE HUB] to change the language model.

./experiments/sample.sh -d alpaca -s [NUMBER OF SAMPLES]

Computing Wasserstein distance

./experiments/compute_wd.sh -d alpaca -s [NUMBER OF SAMPLES]

Computing log probability

./experiments/compute_logprob.sh -d alpaca -s [NUMBER OF SAMPLES]

Computing the reward of the samples

./experiments/compute_reward.sh -d alpaca -s [NUMBER OF SAMPLES] -i stanfordnlp/SteamSHP-flan-t5-large
./experiments/compute_reward.sh -d alpaca -s [NUMBER OF SAMPLES] -i OpenAssistant/reward-model-deberta-v3-large-v2

Computing RBoN

python3 mbr/compute_rbon.py --dataset alpaca --ncandidates [NUMBER OF SAMPLES]

Reference

Jinnai, Y., Morimura, T., Ariu, K., and Abe, K. Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment. ICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024.

Bibtex:

@inproceedings{
jinnai2024regularized,
title={Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment},
author={Yuu Jinnai and Tetsuro Morimura and Kaito Ariu and Kenshi Abe},
booktitle={ICML 2024 Workshop on Models of Human Feedback for AI Alignment},
year={2024},
url={https://openreview.net/forum?id=ewRlZPAReR}
}

Contact

For any questions, feel free to raise an issue or contact me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
experiments		experiments
mbr		mbr
prompts		prompts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regularized Best-of-N

Usage

Sampling candidates

Computing Wasserstein distance

Computing log probability

Computing the reward of the samples

Computing RBoN

Reference

Contact

About

Releases

Packages

Languages

License

CyberAgentAILab/regularized-bon

Folders and files

Latest commit

History

Repository files navigation

Regularized Best-of-N

Usage

Sampling candidates

Computing Wasserstein distance

Computing log probability

Computing the reward of the samples

Computing RBoN

Reference

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages