Skip to content

Commit

Permalink
Add ORPO within README.md files (#154)
Browse files Browse the repository at this point in the history
* Add `ORPO` within `scripts/README.md`

* Fix typo in `ModelArguments.base_model_revision`

* Add `ORPO` within `README.md`

* Add Zephyr 141B in "News" section
  • Loading branch information
alvarobartt authored Apr 25, 2024
1 parent 70769f9 commit cf1975a
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 7 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ However, we know from the [InstructGPT](https://huggingface.co/papers/2203.02155
The Alignment Handbook aims to fill that gap by providing the community with a series of robust training recipes that span the whole pipeline.

## News 🗞️
* **April 12, 2024**: We release Zephyr 141B (A35B), in collaboration with Argilla and Kaist AI, along with the recipe to fine-tune Mixtral 8x22B with ORPO 🪁
* **March 12, 2024:** We release StarChat2 15B, along with the recipe to train capable coding assistants 🌟
* **March 1, 2024:** We release Zephyr 7B Gemma, which is a new recipe to align Gemma 7B with RLAIF 🔥
* **February 1, 2024:** We release a recipe to align open LLMs with Constitutional AI 📜! See the [recipe](https://github.com/huggingface/alignment-handbook/tree/main/recipes/constitutional-ai) and the [blog post](https://huggingface.co/blog/constitutional_ai) for details.
Expand All @@ -33,7 +34,7 @@ The Alignment Handbook aims to fill that gap by providing the community with a s

This project is simple by design and mostly consists of:

* [`scripts`](./scripts/) to train and evaluate models. Three steps are included: continued pretraining, supervised-finetuning (SFT) for chat, and preference alignment with DPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
* [`scripts`](./scripts/) to train and evaluate models. Four steps are included: continued pretraining, supervised-finetuning (SFT) for chat, preference alignment with DPO, and supervised-finetuning with preference alignment with ORPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
* [`recipes`](./recipes/) to reproduce models like Zephyr 7B. Each recipe takes the form of a YAML file which contains all the parameters associated with a single training run. A `gpt2-nl` recipe is also given to illustrate how this handbook can be used for language or domain adaptation, e.g. by continuing to pretrain on a different language, and then SFT and DPO tuning the result.

We are also working on a series of guides to explain how methods like direct preference optimization (DPO) work, along with lessons learned from gathering human preferences in practice. To get started, we recommend the following:
Expand All @@ -53,6 +54,7 @@ The initial release of the handbook will focus on the following techniques:
* **Reward modeling:** teach language models to distinguish model responses according to human or AI preferences.
* **Rejection sampling:** a simple, but powerful technique to boost the performance of your SFT model.
* **Direct preference optimisation (DPO):** a powerful and promising alternative to PPO.
* **Odds Ratio Preference Optimisation (ORPO)**: a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.

## Installation instructions

Expand Down
15 changes: 10 additions & 5 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Scripts to Train and Evaluate Chat Models

## Fine-tuning
Expand All @@ -25,7 +24,13 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_qlora.yaml --load_in_4bit=false
```

Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported: continued pretraining `cpt`, supervised finetuning `sft`, and direct preference optimisation `dpo`. Note that `cpt` is only present in the `gpt-nl` example recipe. {model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:
Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported:
* continued pretraining `cpt` (note that `cpt` is only present in the `gpt-nl` example recipe)
* supervised finetuning `sft`
* direct preference optimisation `dpo`
* odds ratio preference optimisation `orpo`

`{model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:

```shell
# Step 1 - train SFT policy
Expand Down Expand Up @@ -85,14 +90,14 @@ dataset_splits:
- test_xxx # The test splits to mix
```
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, etc), requires a different format, we assume the datasets have the following columns:
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, ORPO, etc), requires a different format, we assume the datasets have the following columns:
**SFT**
* `messages`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}`.
* See [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) for an example.

**DPO**
**DPO and ORPO**

* `chosen`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the preferred dialogue.
* `rejected`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the dispreferred dialogue.
Expand Down Expand Up @@ -130,4 +135,4 @@ For both benchmarks, we have added support for the [Zephyr chat template](https:

Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibit various biases including a preference for models distilled from GPTs. For that reason, we also recommend submitting your best models for human evaluation in:

* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
2 changes: 1 addition & 1 deletion src/alignment/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ class ModelArguments:

base_model_revision: Optional[str] = field(
default=None,
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adatpers.")},
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adapters.")},
)
model_name_or_path: Optional[str] = field(
default=None,
Expand Down

0 comments on commit cf1975a

Please sign in to comment.