Zichen Liu*β , Changyu Chen*, Wenjun Li*, Penghui Qi*
Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
*Core Contributors, β Project Lead
π Updates β’ π Links β’ π TL;DR
- 21/03/2025: π We release our paper, models and codebase. Our R1-Zero training is implemented with πΎ Oat, a highly modular, research-friendly and efficient LLM RL framework.
-
Understanding R1-Zero-Like Training
-
There May Not Be Aha Moment in R1-Zero-like Training β A Pilot Study
-
OAT: A research-friendly framework for LLM online alignment
- π» Codebase
To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.
- DeepSeek-V3-Base already exhibit "Aha moment".
- As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates: the average benchmark scores improve by ~60% (compared to the traditional 4-shot prompting)!
- GRPO leads to biased optimization! We propose a simple fix that improves token efficiency while maintaining reasoning performance, termed as Dr. GRPO (GRPO Done Right).
- In R1-Zero-like training, the template and the question set perform a duet to affect the RL dynamics
- (Left Plot) For Qwen2.5-Math-1.5B, a mismatched template (e.g., R1 template) in fact destructs the reasoning capabilities before RL reconstructing it. This makes the improvement impressive on the surface.
- (Middle Plot) However, if a template does not deviate from the pretraining distribution too far, even a small and completely o.o.d. question set (e.g., GSM8K) could induce the reasoning ability equally well, by reinforcing correct reasoning behaviors instead of infusing new knowledge.
- Beyond Qwen, Llama can also be RL-tuned from base models. In this case, domain-specific pretraining will improves RL ceiling.
- (Right Plot) GRPO can even make Llama with math knowledge "Aha" by increasing the output length; however, it is likely due to its length bias, which can be removed by Dr. GRPO.
Our analysis suggests a minimalist recipe for R1-Zero-like training:
We RL-tune Qwen2.5- Math-7B using the (unbiased) Dr. GRPO algorithm on MATH level 3-5 questions with the Qwen-Math template, and achieve state-of-the-art performance with only 27 hours compute on 8Γ A100 GPUs.
If you are interested in more details, please check out our paper!
We recommend a clean python==3.10
environment for development.
# Install vllm & oat, the LLM RL framework we developed r1-zero training on.
pip install vllm==0.7.2 && pip install oat-llm==0.0.9
# Install this package locally to use the math grader.
git clone [email protected]:sail-sg/understand-r1-zero.git && cd understand-r1-zero
pip install -e .
We implement R1-Zero training by extending Oat's Learner and Actor components. Please see train_zero_math.py for a step-by-step guide.
# Patch LD_LIBRARY_PATH to avoid dependency errors:
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH
# Run the experiment (tested on 8 x A100-40G) with Dr. GRPO:
# (change to `--critic_type grpo` for running GRPO)
python train_zero_math.py \
--critic_type drgrpo \
--gpus 8 \
--enable_prefix_caching \
--collocate \
--vllm_sleep \
--vllm_gpu_ratio 0.35 \
--gradient-checkpointing \
--flash-attn \
--bf16 \
--rnd-seed \
--learning_rate 0.000001 \
--lr_scheduler constant \
--num_ppo_epochs 1 \
--beta 0 \
--oracle_type reward \
--oracle math \
--pretrain Qwen/Qwen2.5-Math-1.5B \
--prompt_template r1 \
--zero-stage 2 \
--ref_offload \
--prompt_data ./datasets/train/math_12k \
--train_split train \
--input_key problem \
--output_key answer \
--max-train 9999999 \
--num_prompt_epoch 20 \
--prompt_max_length 1024 \
--num_samples 8 \
--temperature 1 \
--top_p 1 \
--generate_max_length 3000 \
--save_steps -1 \
--train_batch_size 128 \
--train_batch_size_per_device 1 \
--mini_train_batch_size_per_device 1 \
--rollout_batch_size 128 \
--rollout_batch_size_per_device 16 \
--pi_buffer_maxlen_per_device 128 \
--eval_batch_size 200 \
--eval_steps 16 \
--eval_temperature 0 \
--eval_generate_max_length 3000 \
--eval_data ./datasets/evaluation_suite \
--eval_input_key input \
--use-wb \
--wb-run-name qwen2.5-Math-1.5b-r1-zero \
--wb_project oat-zero
Please see here for more example scripts.
# Evaluate our models:
python evaluate_model.py --model_name sail/Qwen2.5-Math-7B-Oat-Zero
python evaluate_model.py --model_name sail/Qwen2.5-Math-1.5B-Oat-Zero
python evaluate_model.py --model_name sail/Llama-3.2-3B-Oat-Zero --template r1
# Evaluate baseline models:
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-1.5B
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-7B
python evaluate_model.py --model_name hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero
python evaluate_model.py --model_name PRIME-RL/Eurus-2-7B-PRIME-Zero
python evaluate_model.py --model_name Open-Reasoner-Zero/Open-Reasoner-Zero-7B
We provide a script to serve DeepSeek-V3-Base and DeepSeek-R1-Zero on k8s cluster.
# prerequisites:
# 1. download the model weights
# 2. starting a k8s job with sglang docker image "lmsysorg/sglang:v0.4.3.post2-cu125"
# start the server:
bash deploy_dpsk/serving.sh <model_name> <num_nodes>
Example of API call:
from openai import OpenAI
# MASTER_ADDR is the environment variable set by the k8s job
api_base = "http://{MASTER_ADDR}:30000/v1"
api_key = "EMPTY"
client = OpenAI(
api_key=api_key,
base_url=api_base,
)
# send requests to the server ...
Notes:
- Your k8s container should have environment variable
MASTER_ADDR
andMASTER_PORT
set. - Hardware requirements:
2 x 8 x H100/800/20
for FP8 and4 x 8 x A100/A800
for BF16. - Please refer to sglang's official tutorial for more details.
If you find our works useful for your research, please consider citing:
-
This paper:
@article{liu2025understanding, title={Understanding R1-Zero-Like Training: A Critical Perspective}, author={Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin}, journal={arXiv preprint arXiv:2503.20783}, year={2025} }
-
Our blog that conducted the first investigation on the "Aha moment":
@misc{liu2025there, title={There May Not be Aha Moment in R1-Zero-like Training β A Pilot Study}, author={Zichen Liu and Changyu Chen and Wenjun Li and Tianyu Pang and Chao Du and Min Lin}, year={2025}, howpublished={\url{https://oatllm.notion.site/oat-zero}}, note={Notion Blog}, }
-
The training framework:
@misc{liu2025oat, title={OAT: A research-friendly framework for LLM online alignment}, author={Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin}, year={2025} howpublished={\url{https://github.com/sail-sg/oat}}, }
- This work is supported by Sea AI Lab for computing resources.
- The training codes are built on Oat, which employs vLLM, DeepSpeed and launchpad. We serve DeepSeek models using SGLang.
- The base models are from Qwen2.5-Math, Llama, and DeepSeek.
- We thank Qingfeng Lan for his time in thoroughly reviewing our code.