diff --git a/chapters/en/chapter12/3.mdx b/chapters/en/chapter12/3.mdx
index ecdbce52a..cfa1fb957 100644
--- a/chapters/en/chapter12/3.mdx
+++ b/chapters/en/chapter12/3.mdx
@@ -11,7 +11,7 @@ In the next chapter, we will build on this knowledge and implement GRPO in pract
The initial goal of the paper was to explore whether pure reinforcement learning could develop reasoning capabilities without supervised fine-tuning.
-Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/chapters/en/chapter11/1).
+Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in Chapter 11.
## The Breakthrough 'Aha' Moment
@@ -171,27 +171,27 @@ Now that we understand the key components of GRPO, let's look at the algorithm i
```
Input:
-- initial_policy: Starting model to be trained
+- current_policy: The model to be trained
- reward_function: Function that evaluates outputs
- training_prompts: Set of training examples
- group_size: Number of outputs per prompt (typically 4-16)
Algorithm GRPO:
1. For each training iteration:
- a. Set reference_policy = initial_policy (snapshot current policy)
+ a. Set reference_policy = current_policy (snapshot BEFORE updates)
b. For each prompt in batch:
- i. Generate group_size different outputs using initial_policy
+ i. Generate group_size different outputs using reference_policy
ii. Compute rewards for each output using reward_function
iii. Normalize rewards within group:
normalized_advantage = (reward - mean(rewards)) / std(rewards)
- iv. Update policy by maximizing the clipped ratio:
+ iv. Update current_policy by maximizing:
min(prob_ratio * normalized_advantage,
- clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)
- - kl_weight * KL(initial_policy || reference_policy)
+ clip(prob_ratio, 1-ε, 1+ε) * normalized_advantage)
+ - β * KL(current_policy || reference_policy)
- where prob_ratio is current_prob / reference_prob
+ where prob_ratio is current_policy_prob / reference_policy_prob, and β is the KL weight
-Output: Optimized policy model
+Output: Optimized current_policy model
```
This algorithm shows how GRPO combines group-based advantage estimation with policy optimization while maintaining stability through clipping and KL divergence constraints.
@@ -235,15 +235,15 @@ In the next section, we'll explore practical implementations of these concepts,