Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMa 2 Fine-tuning data #2

Open
kibitzing opened this issue Jun 17, 2024 · 1 comment
Open

LLaMa 2 Fine-tuning data #2

kibitzing opened this issue Jun 17, 2024 · 1 comment

Comments

@kibitzing
Copy link
Owner

kibitzing commented Jun 17, 2024

SFT data

  1. Started the SFT stage with publicly available instruction tuning data (Chung et al., 2022)
  2. Fewer but high quality > Millions of data but low quality

By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved.

We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. (stopped annotating SFT after 27,540 annotations)

  1. Note that we do not include any Meta user data.

SFT data quality check

  • To validate our data quality, we carefully examined a set of 180 examples, comparing the annotations provided by humans with the samples generated by the model through manual scrutiny.

  • Sometimes, Model output quality > human handwritten output quality

Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could reprioritize and devote more annotation effort to preference-based annotation for RLHF.

@kibitzing
Copy link
Owner Author

kibitzing commented Jun 17, 2024

RLHF data

Reward modeling

  • We chose a binary comparison protocol over other schemes, mainly because it enables us to maximize the diversity of collected prompts
    • other strategies are worth considering, which we leave for future work
    • human annotators select which of two model outputs they prefer

Annotation procedure

Annotators...

  1. Write a prompt
  2. Choose between two sampled model responses based on provided criteria
  • In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants and varying the temperature hyper-parameter.
  1. Label the degree to which they prefer their chosen response over the alternative (4 scales)
  • significantly better, better, slightly better, or negligibly better/ unsure

Two criteria

  • helpfulness
    • Helpfulness refers to how well Llama 2-Chat responses fulfill users’ requests and provide requested information
  • safety
    • safety refers to whether Llama 2-Chat’s responses are unsafe

In addition to preference, we also annotate "absolute safety" with three categories:

  1. the preferred response is safe and the other response is not (18%)
  2. both responses are safe (47%)
  3. both responses are unsafe (35%)
  4. the preferred response is unsafe and the other response is safe (0%)

We do not include any examples from category 4, as we believe safer responses will also be better/preferred by humans.

Human annotation collection process

  • Human annotations were collected every week (by batch)
  • As we collected more preference data, our reward models improved, and we were able to train progressively better versions for Llama 2-Chat.
    • more data -> better model -> better output -> better data
    • Llama 2-Chat improvement also shifted the model’s data distribution
  • It is important to gather new preference data using the latest Llama 2-Chat iterations before starting a new tuning iteration.
    • This step helps keep the reward model on-distribution and maintain an accurate reward for the latest model

Comparisons with other datasets

Screenshot 2024-06-17 at 9 41 15 PM
  • We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines
  • Note that the number of tokens in prompts and answers differs depending on the text domain.
    • Summarization and online forum data generally have longer prompts
    • Dialogue-style prompts are usually shorter
  • Meta data: more conversation turns and longer on average.

Data Composition

  1. Open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data.
  2. In our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking
  3. Experimented with different mixing recipes for both Helpfulness and Safety reward models and best settings til now:
  • Helpfulness reward model:
    • All Meta Helpfulness data (50%)
    • Remaining data: uniformly sampled from Meta Safety and from the open-source datasets (50%)
    • Here is the original text for reference, as there are some ambiguous aspects.

      Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets.

  • Safety reward model:
    • All Meta Safety and Anthropic Harmless data (90%)
    • Meta Helpfulness and open-source helpfulness data (10%)
    • We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe. (Category 2 in terms of "absolute safety" above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant