LLaMa 2 Fine-tuning data #2

kibitzing · 2024-06-17T12:00:03Z

SFT data

Started the SFT stage with publicly available instruction tuning data (Chung et al., 2022)
Fewer but high quality > Millions of data but low quality

By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved.

We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. (stopped annotating SFT after 27,540 annotations)

Note that we do not include any Meta user data.

SFT data quality check

To validate our data quality, we carefully examined a set of 180 examples, comparing the annotations provided by humans with the samples generated by the model through manual scrutiny.
Sometimes, Model output quality > human handwritten output quality

Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could reprioritize and devote more annotation effort to preference-based annotation for RLHF.

kibitzing · 2024-06-17T12:52:26Z

RLHF data

Reward modeling

We chose a binary comparison protocol over other schemes, mainly because it enables us to maximize the diversity of collected prompts
- other strategies are worth considering, which we leave for future work
- human annotators select which of two model outputs they prefer

Annotation procedure

Annotators...

Write a prompt
Choose between two sampled model responses based on provided criteria

In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants and varying the temperature hyper-parameter.

Label the degree to which they prefer their chosen response over the alternative (4 scales)

significantly better, better, slightly better, or negligibly better/ unsure

Two criteria

helpfulness
- Helpfulness refers to how well Llama 2-Chat responses fulfill users’ requests and provide requested information
safety
- safety refers to whether Llama 2-Chat’s responses are unsafe

In addition to preference, we also annotate "absolute safety" with three categories:

the preferred response is safe and the other response is not (18%)
both responses are safe (47%)
both responses are unsafe (35%)
the preferred response is unsafe and the other response is safe (0%)

We do not include any examples from category 4, as we believe safer responses will also be better/preferred by humans.

Human annotation collection process

Human annotations were collected every week (by batch)
As we collected more preference data, our reward models improved, and we were able to train progressively better versions for Llama 2-Chat.
- more data -> better model -> better output -> better data
- Llama 2-Chat improvement also shifted the model’s data distribution
It is important to gather new preference data using the latest Llama 2-Chat iterations before starting a new tuning iteration.
- This step helps keep the reward model on-distribution and maintain an accurate reward for the latest model

Comparisons with other datasets

We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines
Note that the number of tokens in prompts and answers differs depending on the text domain.
- Summarization and online forum data generally have longer prompts
- Dialogue-style prompts are usually shorter
Meta data: more conversation turns and longer on average.

Data Composition

Open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data.
In our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking
Experimented with different mixing recipes for both Helpfulness and Safety reward models and best settings til now:

Helpfulness reward model:
- All Meta Helpfulness data (50%)
- Remaining data: uniformly sampled from Meta Safety and from the open-source datasets (50%)
- Here is the original text for reference, as there are some ambiguous aspects.
  
  Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets.
Safety reward model:
- All Meta Safety and Anthropic Harmless data (90%)
- Meta Helpfulness and open-source helpfulness data (10%)
- We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe. (Category 2 in terms of "absolute safety" above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMa 2 Fine-tuning data #2

LLaMa 2 Fine-tuning data #2

kibitzing commented Jun 17, 2024 •

edited

Loading

kibitzing commented Jun 17, 2024 •

edited

Loading

LLaMa 2 Fine-tuning data #2

LLaMa 2 Fine-tuning data #2

Comments

kibitzing commented Jun 17, 2024 • edited Loading

SFT data

SFT data quality check

kibitzing commented Jun 17, 2024 • edited Loading

RLHF data

Reward modeling

Annotation procedure

Two criteria

Human annotation collection process

Comparisons with other datasets

Data Composition

kibitzing commented Jun 17, 2024 •

edited

Loading

kibitzing commented Jun 17, 2024 •

edited

Loading