Code-R1: Reproducing R1 for Code with Reliable Rewards

This repository includes implementations to reproduce the R1 pipeline for code generation:

Result: 2K Code-R1 samples + Qwen2.5-7B-Instruct-1M beats Qwen2.5-Coder-7B-Instruct (even better w/ 12K samples).
Finding: Quality of rewards matters. False positives in datasets and execution confuse the model.
Implementation: A reliable, scalable, and sandboxed pipeline to minimize reward false positives in datasets and execution.

More results and findings to come...

Setup

Environment

# For training
pip install -e .
pip install vllm==0.7.3
pip install flash-attn --no-build-isolation
pip install wandb IPython matplotlib gpustat # utility

Sandboxing

I tried multiple ways for sandboxing including calling code execution servers, running dockerized Python, calling paid services, etc. firejail is the approach I found to meet all the three:

Reliability -- False positive comes when "autograders" have concurrency issue (timeouts), violating OS limits (OSError), etc.
Scalability -- e.g., dockerized Python run is generally reliable but too slow (e.g., 20 samples/s on 192 cores).
Security -- ... otherwise the school IT will email you and stop your server...

sudo add-apt-repository ppa:deki/firejail
sudo apt-get update
sudo apt-get install firejail firejail-profiles

Datasets

The current version has 12K RL samples (prompt + tests) at 🤗 ganler/code-r1-12k:

2K LeetCode data where the tests are generally reliable
10K verified data filtered from 26K TACO data.

In general, it's suggesgted to test data & sandbox on every dataset & environment before training code RL. Directly using noisy data and mismatched envornments can lead to reward false positives, confusing the model. These noise could come from (i) wrong tests, (ii) unsolvable prompts (e.g., images tags), and (iii) execution environment mismatch.

To produce locally validated RL data:

python examples/data_preprocess/coder1.py

Run!

bash main_grpo.sh

Note

The script was optimized for single-node 8x H200 setup. You might need to customize the settings for your own workstation.

Code-R1 Zero based on 7B models

We trained two models based on Qwen2.5-7B-Instruct-1M by pure R1 Zero:

🤗 CodeR1-Zero-Qwen2.5-7B-12k-832: using 12K RL samples trained in 832 steps (training logs).
🤗 CodeR1-Zero-Qwen2.5-7B-LC2k-1088: using 2K RL samples from LeetCode, trained in 1088 steps (training logs).

Model	LCB (v5)	HumanEval+	MBPP+	Average
Qwen2.5-7B-Instruct-1M	24.0	80.5	66.7	57.1
+ Code-R1-Zero (2k - 1088s GRPO)	28.6	84.8	70.1	61.2
+ Code-R1-Zero (12k - 832s GRPO)	29.7	83.5	74.3	🌟62.5

2K leetcode training samples can already show promising results without any additional SFT or distillation.
Adding it to 12K data (10K more verified data from TACO) can further improve the performance.

Some existing models:

Model	LCB (v5)	HumanEval+	MBPP+	Average
Qwen2.5-Coder-7B-Instruct	31.1	82.3	69.6	61.0
Eurus-2-7B-PRIME	23.8	65.9	29.9	39.9
Sky-T1-7B	21.3	54.3	50.8	42.1

Qwen2.5-Coder-7B-Instruct, despite released months back, is still very performant as the best baseline, but we don't know where the improvement comes from.
Eurus-2-7B-PRIME starts from Qwen2.5-Math-7B-Instruct and is RL only. Its training data includes (unfiltered) extensive coding datasets, including APPS, CodeContests, TACO, and Codeforces. Code-R1-Zero outperforms it significantly despite using fewer data, likely because we use validated datasets and sandboxes.
Sky-T1-7B uses a combination of RL and SFT/distillation steps. Its RL partially uses PRIME but its training data does not (seem to) include coding datasets.

Citation

If you find this work helpful...

@article{code-r1,
  title={Code-R1: Reproducing R1 for Code with Reliable Rewards},
  author={Liu, Jiawei and Zhang, Lingming},
  howpublished={\url{https://github.com/ganler/code-r1}},
  year={2025}
}

Acknowledgements

License

Apache-2.0. See LICENSE.code-r1 for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
LeetCodeDataset @ f3519e6		LeetCodeDataset @ f3519e6
docker		docker
docs		docs
examples		examples
patches		patches
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
LICENSE.code-r1		LICENSE.code-r1
Notice.txt		Notice.txt
README.md		README.md
README.md.verl		README.md.verl
main_grpo.sh		main_grpo.sh
main_ppo.sh		main_ppo.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Code-R1: Reproducing R1 for Code with Reliable Rewards

Setup

Environment

Sandboxing

Datasets

Run!

Code-R1 Zero based on 7B models

Citation

Acknowledgements

License

About

Licenses found

Uh oh!

Uh oh!

Contributors 22

Uh oh!

Languages

License

Licenses found

ganler/code-r1

Folders and files

Latest commit

History

Repository files navigation

Code-R1: Reproducing R1 for Code with Reliable Rewards

Setup

Environment

Sandboxing

Datasets

Run!

Code-R1 Zero based on 7B models

Citation

Acknowledgements

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 22

Uh oh!

Languages