Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHAPTER] New chapter on supervised fine tuning based on smol course #777

Merged
merged 31 commits into from
Feb 17, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7bc134b
initial copy from smol-course
burtenshaw Jan 29, 2025
995493b
convert smol course material into nlp course style
burtenshaw Jan 30, 2025
beec8b5
review text and read through
burtenshaw Jan 30, 2025
4cb5f93
add links to colab
burtenshaw Feb 5, 2025
f7fc25d
add quiz app
burtenshaw Feb 5, 2025
564d9ec
add toc
burtenshaw Feb 5, 2025
edcf049
format code blocks
burtenshaw Feb 5, 2025
267c171
combine pages together and add extra guidance
burtenshaw Feb 5, 2025
a9847d0
update toc and format snippets
burtenshaw Feb 5, 2025
82b1d4a
update structure
burtenshaw Feb 5, 2025
549612b
followinf readthrough: simplify and add more tips
burtenshaw Feb 6, 2025
881865e
format code blocks
burtenshaw Feb 6, 2025
a386bbf
suggestions in intro page
burtenshaw Feb 11, 2025
6cefbc9
respond to suggestions on chat templates page
burtenshaw Feb 11, 2025
3b7cc5a
Update chapters/en/chapter11/3.mdx
burtenshaw Feb 11, 2025
c6800f1
Update chapters/en/chapter11/5.mdx
burtenshaw Feb 11, 2025
844d715
Update chapters/en/chapter11/5.mdx
burtenshaw Feb 11, 2025
7d0519c
Merge branch 'add-supervised-finetuning' of https://github.com/burten…
burtenshaw Feb 11, 2025
3f9815c
respond to suggestions in SFT page
burtenshaw Feb 11, 2025
c47a5a5
improve loss illustrations on sft page
burtenshaw Feb 11, 2025
d66fa86
respond to feedback in chat template
burtenshaw Feb 12, 2025
21c8dd1
respond to feedback on sft section
burtenshaw Feb 12, 2025
5d4025d
respond to feedback on lora section
burtenshaw Feb 12, 2025
e0ecc8c
respond to feedback in unit 5
burtenshaw Feb 12, 2025
2c30171
update toc with new tag and subtitle
burtenshaw Feb 14, 2025
cc7ddee
improve intro congruency with previous chapters
burtenshaw Feb 14, 2025
f040b6c
make chat templates more about structure
burtenshaw Feb 14, 2025
6d2a54c
add packing and references to the sft section
burtenshaw Feb 14, 2025
3a2ee3c
fix qlora mistake in lora page
burtenshaw Feb 14, 2025
a02b2d2
add more benchmarks to evaluation section
burtenshaw Feb 14, 2025
c74ebd3
add final quizzes to quiz section
burtenshaw Feb 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions chapters/en/chapter11/1.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Supervised Fine-Tuning

This chapter will introduce fine-tuning generative language models with supervised fine-tuning (SFT). SFT involves adapting pre-trained models to specific tasks by further training them on task-specific datasets. This process helps models improve their performance on targeted tasks. We will separate this chapter into three sections:

## 1️⃣ Chat Templates

Chat templates structure interactions between users and AI models, ensuring consistent and contextually appropriate responses. They include components like system prompts and role-based messages.

## 2️⃣ Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) is a critical process for adapting pre-trained language models to specific tasks. It involves training the model on a task-specific dataset with labeled examples. For a detailed guide on SFT, including key steps and best practices.

## 3️⃣ Low Rank Adaptation (LoRA)

Low Rank Adaptation (LoRA) is a technique for fine-tuning language models by adding low-rank matrices to the model's layers. This allows for efficient fine-tuning while preserving the model's pre-trained knowledge.

## 4️⃣ Evaluation

Evaluation is a crucial step in the fine-tuning process. It allows us to measure the performance of the model on a task-specific dataset.

<Tip>
⚠️ In order to benefit from all features available with the Model Hub and 🤗 Transformers, we recommend <a href="https://huggingface.co/join">creating an account</a>.
</Tip>

## References

- [Transformers documentation on chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating)
- [Script for Supervised Fine-Tuning in TRL](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py)
- [`SFTTrainer` in TRL](https://huggingface.co/docs/trl/main/en/sft_trainer)
- [Direct Preference Optimization Paper](https://arxiv.org/abs/2305.18290)
- [Supervised Fine-Tuning with TRL](https://huggingface.co/docs/trl/main/en/tutorials/supervised_finetuning)
- [How to fine-tune Google Gemma with ChatML and Hugging Face TRL](https://www.philschmid.de/fine-tune-google-gemma)
- [Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format](https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format)
56 changes: 56 additions & 0 deletions chapters/en/chapter11/10.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Implementing Evaluation

In this section, we will implement evaluation for our finetuned model. We can use `lighteval` to evaluate our finetuned model on standard benchmarks, which contains a wide range of tasks built into the library. We just need to define the tasks we want to evaluate and the parameters for the evaluation.

LightEval tasks are defined using a specific format:

```
{suite}|{task}|{num_few_shot}|{auto_reduce}
```

| Parameter | Description |
|-----------|-------------|
| `suite` | The benchmark suite (e.g., 'mmlu', 'truthfulqa') |
| `task` | Specific task within the suite (e.g., 'abstract_algebra') |
| `num_few_shot` | Number of examples to include in prompt (0 for zero-shot) |
| `auto_reduce` | Whether to automatically reduce few-shot examples if prompt is too long (0 or 1) |

Example: `"mmlu|abstract_algebra|0|0"` evaluates on MMLU's abstract algebra task with zero-shot inference.

## Example Evaluation Pipeline

Let's set up an evaluation pipeline for our finetuned model. We will evaluate the model on set of sub tasks that relate to the domain of medicine.

Here's a complete example of evaluating on automatic benchmarks relevant to one specific domain using Lighteval with the VLLM backend:

```bash
lighteval vllm \
"pretrained=your-model-name" \
"mmlu|anatomy|0|0" \
"mmlu|high_school_biology|0|0" \
"mmlu|high_school_chemistry|0|0" \
"mmlu|professional_medicine|0|0" \
--max_samples 40 \
--batch_size 1 \
--output_path "./results" \
--save_generations true
```

Results are displayed in a tabular format showing:

```
| Task |Version|Metric|Value | |Stderr|
|----------------------------------------|------:|------|-----:|---|-----:|
|all | |acc |0.3333|± |0.1169|
|leaderboard:mmlu:_average:5 | |acc |0.3400|± |0.1121|
|leaderboard:mmlu:anatomy:5 | 0|acc |0.4500|± |0.1141|
|leaderboard:mmlu:high_school_biology:5 | 0|acc |0.1500|± |0.0819|
```

Lighteval also include a python API for more detailed evaluation tasks, which is useful for manipulating the results in a more flexible way. Check out the [Lighteval documentation](https://huggingface.co/docs/lighteval/using-the-python-api) for more information.

<Tip>

✏️ **Try it out!** Evaluate your finetuned model on a specific task in lighteval.

</Tip>
13 changes: 13 additions & 0 deletions chapters/en/chapter11/11.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Conclusion

In this chapter, we explored the essential components of fine-tuning language models:

1. **Chat Templates** provide structure to model interactions, ensuring consistent and appropriate responses through standardized formatting.

2. **Supervised Fine-Tuning (SFT)** allows adaptation of pre-trained models to specific tasks while maintaining their foundational knowledge.

3. **LoRA** offers an efficient approach to fine-tuning by reducing trainable parameters while preserving model performance.

4. **Evaluation** helps measure and validate the effectiveness of fine-tuning through various metrics and benchmarks.

These techniques, when combined, enable the creation of specialized language models that can excel at specific tasks while remaining computationally efficient. Whether you're building a customer service bot or a domain-specific assistant, understanding these concepts is crucial for successful model adaptation.
66 changes: 66 additions & 0 deletions chapters/en/chapter11/2.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Chat Templates

Chat templates are essential for structuring interactions between language models and users. They provide a consistent format for conversations, ensuring that models understand the context and role of each message while maintaining appropriate response patterns.

## Base Models vs Instruct Models

A base model is trained on raw text data to predict the next token, while an instruct model is fine-tuned specifically to follow instructions and engage in conversations. For example, `SmolLM2-135M` is a base model, while `SmolLM2-135M-Instruct` is its instruction-tuned variant.

To make a base model behave like an instruct model, we need to format our prompts in a consistent way that the model can understand. This is where chat templates come in. ChatML is one such template format that structures conversations with clear role indicators (system, user, assistant).

It's important to note that a base model could be fine-tuned on different chat templates, so when we're using an instruct model we need to make sure we're using the correct chat template.

## Understanding Chat Templates

At their core, chat templates are structured string representations of conversations. They define how conversations should be formatted when communicating with a language model. They include system-level instructions, user messages, and assistant responses in a structured format that the model can understand. This structure helps maintain consistency across interactions and ensures the model responds appropriately to different types of inputs. Below is an example of a chat template:

```sh
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
```

The `transformers` library will take care of chat templates for you in relation to the model's tokenizer. Read more about how transformers builds chat templates [here](https://huggingface.co/docs/transformers/en/chat_templating#how-do-i-use-chat-templates). All we have to do is structure our messages in the correct way and the tokenizer will take care of the rest. Here's a basic example of a conversation:

```python
messages = [
{"role": "system", "content": "You are a helpful assistant focused on technical topics."},
{"role": "user", "content": "Can you explain what a chat template is?"},
{"role": "assistant", "content": "A chat template structures conversations between users and AI models..."}
]
```

Let's break down the above example, and see how it maps to the chat template format.

## System Messages

System messages set the foundation for how the model should behave. They act as persistent instructions that influence all subsequent interactions. For example:

```python
system_message = {
"role": "system",
"content": "You are a professional customer service agent. Always be polite, clear, and helpful."
}
```

## Conversations

Chat templates can maintain context through conversation history, storing previous exchanges between users and the assistant. This allows for more coherent multi-turn conversations:

```python
conversation = [
{"role": "user", "content": "I need help with my order"},
{"role": "assistant", "content": "I'd be happy to help. Could you provide your order number?"},
{"role": "user", "content": "It's ORDER-123"},
]
```

<Tip>

✏️ **Try it out!** Create a chat template for a conversation between a user and an assistant. Then, use the `transformers` library to tokenize the conversation and see how the model responds. You won't need to download the model to do this, as the tokenizer will handle the formatting.

</Tip>
83 changes: 83 additions & 0 deletions chapters/en/chapter11/3.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Implementation with Transformers

Now that we understand how chat templates work, let's see how we can implement them using the `transformers` library. The transformers library provides built-in support for chat templates, we just need to use the `apply_chat_template()` method to format our messages.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to sort a list"},
]

# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
```

This will return a formatted string that can be passed to the model. It would look like this for the SmolLM2-135M-Instruct model specified:

```sh
<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function to sort a list<|im_end|>
```

Note that the `im_start` and `im_end` tokens are used to indicate the start and end of a message. The tokenizer will also have corresponding special tokens for the start and end of messages. For a refresher on how these tokens work, see the [Tokenizers](../chapter2/5.mdx) section.

Chat templates can handle multi-turn conversations while maintaining context:

```python
messages = [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is calculus?"},
{"role": "assistant", "content": "Calculus is a branch of mathematics..."},
{"role": "user", "content": "Can you give me an example?"},
]
```

## Working with Chat Templates

When working with chat templates, you have several options for processing the conversation:

1. Apply the template without tokenization to return the raw formatted string
2. Apply the template with tokenization to return the token IDs
3. Add a generation prompt to prepare for model inference

The tokenizer's `apply_chat_template()` method handles all these cases through its parameters:

- `tokenize`: Whether to return token IDs (True) or the formatted string (False)
- `add_generation_prompt`: Whether to add a prompt for the model to generate a response

<Tip>

✏️ **Try it out!** Take a dataset from the Hugging Face hub and process it for Supervised Fine-Tuning (SFT). Convert the `HuggingFaceTB/smoltalk` dataset into chatml format and save it to a new file.

For this exercise, you'll need to:
1. Load the dataset using the Hugging Face datasets library
2. Create a processing function that converts the samples into the correct chat format
3. Apply the chat template using the tokenizer's methods

</Tip>

## Conclusion

Chat templates are a crucial component for working with language models, especially when fine-tuning or deploying models for chat applications. They provide structure and consistency to conversations, making it easier for models to understand context and generate appropriate responses.

Understanding how to work with chat templates is essential for:
- Converting datasets for fine-tuning
- Preparing inputs for model inference
- Maintaining conversation context
- Ensuring consistent model behavior

## Resources

- [Hugging Face Chat Templating Guide](https://huggingface.co/docs/transformers/main/en/chat_templating)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Chat Templates Examples Repository](https://github.com/chujiezheng/chat_templates)
125 changes: 125 additions & 0 deletions chapters/en/chapter11/4.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) is a critical process for adapting pre-trained language models to specific tasks or domains. While pre-trained models have impressive general capabilities, they often need to be customized to excel at particular use cases. SFT bridges this gap by further training the model on relevant datasets with human-validated examples.

Because of the supervised structure of the task, the model can learn to generate structured outputs. For example, the chat templates we created in the previous sections.

## Understanding Supervised Fine-Tuning

Supervised fine-tuning is about teaching a pre-trained model to perform specific tasks, and use specific output structures, through examples of labeled tokens. The process involves showing the model many examples of the desired input-output behavior, allowing it to learn the patterns specific to your use case.

SFT is effective because it uses the foundational knowledge acquired during pre-training while adapting the model's behavior to match your specific needs.

## When to Use Supervised Fine-Tuning

The decision to use SFT often comes down to the gap between your model's current capabilities and your specific requirements. SFT becomes particularly valuable when you need precise control over the model's outputs or when working in specialized domains.

Two core reasons to use SFT are:

1. **Template Control**: SFT allows you to control the output structure of the model, ensuring that it generates outputs in a specific format. For example, you need a specific chat template to generate structured outputs.

2. **Domain-Specific Requirements**: SFT is effective when you need precise control over the model's outputs in specialized domains. For example, if you're developing a customer service application, you might want your model to consistently follow company guidelines and handle technical queries in a standardized way. SFT can help align the model's responses with professional standards and domain expertise.

## Quiz

### 1. What is the primary purpose of Supervised Fine-Tuning (SFT)?

<Question
choices={[
{
text: "To train a language model from scratch",
explain: "SFT builds upon pre-trained models rather than training from scratch."
},
{
text: "To adapt a pre-trained model to specific tasks or domains while maintaining its foundational knowledge",
explain: "Correct! SFT allows models to learn specific tasks while leveraging their pre-trained capabilities.",
correct: true
},
{
text: "To compress a large language model into a smaller one",
explain: "This is more related to model distillation, not SFT."
}
]}
/>

### 2. Which of the following are valid reasons to use SFT?

<Question
choices={[
{
text: "Template Control - ensuring the model generates outputs in a specific format",
explain: "Yes! SFT helps enforce specific output structures through training examples.",
correct: true
},
{
text: "Domain Adaptation - teaching the model domain-specific knowledge and terminology",
explain: "Correct! SFT is excellent for adapting models to specialized domains.",
correct: true
},
{
text: "Model Architecture Changes - modifying the underlying structure of the model",
explain: "SFT doesn't change the model architecture, it only updates the weights."
}
]}
/>

### 3. What is required for effective Supervised Fine-Tuning?

<Question
choices={[
{
text: "A pre-trained language model",
explain: "Yes! SFT starts with a pre-trained model as its foundation.",
correct: true
},
{
text: "Validated examples of desired input-output behavior",
explain: "Correct! Quality training data is crucial for successful SFT.",
correct: true
},
{
text: "A high performing reference model",
explain: "SFT uses existing architectures rather than creating new ones."
}
]}
/>

### 4. How does SFT relate to chat templates?

<Question
choices={[
{
text: "SFT can train models to consistently follow specific chat templates",
explain: "Correct! SFT helps models learn to generate responses in the desired template format.",
correct: true
},
{
text: "Chat templates are not compatible with SFT",
explain: "Incorrect! Chat templates are commonly used with SFT for structured outputs."
},
{
text: "SFT automatically creates chat templates",
explain: "SFT doesn't create templates, it trains models to use existing templates."
}
]}
/>

### 5. What distinguishes SFT from pre-training?

<Question
choices={[
{
text: "SFT uses labeled data for specific tasks",
explain: "Yes! SFT requires examples of desired behavior for specific tasks.",
correct: true
},
{
text: "SFT is faster than pre-training",
explain: "The speed difference isn't a defining characteristic; it depends on various factors."
},
{
text: "SFT requires more data than pre-training",
explain: "Actually, SFT typically uses less data than pre-training, focusing on task-specific examples."
}
]}
/>
Loading
Loading