π News β’ π Method β’ π Results
π Getting Started β’ π Citation β’ π‘ Acknowledgement
We propose GenPRM, a strong generative process reward model with the following features:
- performing explicit CoT reasoning and code verfication before providing the process judgment;
- improving Monte Carlo estimation and hard label with Relative Progress Estimation (RPE);
- supporting GenPRM test-time scaling in a parallel manner with majority voting;
- supporting policy model test-time scaling with GenPRM as verifiers or critics.
We will release all code, model, and data, including:
- GenPRM with parameters of 1.5B, 7B, 14B, 32B, and 70B (ongoing);
- 23K training data from MATH dataset;
- all details including solution generation, Monte Carlo estimation, RPE, model training and inference (ongoing).
- [2025-04-24] β¨ The full data synthesis code is available.
- [2025-04-14] π’ GenPRM is reported by Synced (ζΊε¨δΉεΏ)!
- [2025-04-06] β¨ The evaluation code and GenPRM-32B are available.
- [2025-04-05] β¨ The inference code is available.
- [2025-04-03] β¨ Our models (GenPRM-1.5B & GenPRM-7B) and training data are released on HuggingFace.
- [2025-04-01] π Our paper is released on arXiv.
Our framework:
Clone the repository:
git clone https://github.com/RyanLiu112/GenPRM.git
cd GenPRM/src
Create a new conda environment and install the dependencies:
conda create -n GenPRM python=3.10
conda activate GenPRM
pip install -r requirements.txt
Try GenPRM in action with:
- Interactive Jupyter Notebook: demo.ipynb (quick start of GenPRM inference)
- Process Supervision Cases: Case 1 | Case 2
For a quick start, you can use gemprm_inference module to implement model inference:
from prm_evaluation.genprm_inference import GenPRM, CodeExecutor
genprm = GenPRM('GenPRM/GenPRM-7B')
messages = [
{"role": "system", "content": "You are a math teacher. Your task is to review and critique the paragraphs in solution step by step."},
{"role": "user", "content": "Question: Jo adds up all the positive integers from 1 to 100. Kate does a similar thing with the first 100 positive integers; however, she first rounds every integer to its nearest multiple of 10 (rounding 5s up) and then adds the 100 values. What is the positive difference between Jo's sum and Kate's sum?\n\nFirst, we need to calculate Jo's sum, which is the sum of all positive integers from 1 to 100. This can be directly computed using the formula for the sum of the first \\(n\\) positive integers, which is \\(\\frac{n(n+1)}{2}\\). For \\(n = 100\\), Jo's sum is \\(\\frac{100 \\cdot 101}{2} = 5050\\)."},
]
code_executor = CodeExecutor()
output, reward = genprm.inference(messages, cur_step=1, code_executor=code_executor)
print("Model output for the first solution step: " + output[0])
print(reward)
Generate policy steps
# example of math
bash reward_generation/steps_generate.sh \
--LM models--Qwen--Qwen2.5-7B-Instruct \
--round 0 \
--bs 4 \
--mt 6000 \
--n_gpus 1 \
--task math \
--loop 1
Generate monte carlo scores
# example of math
bash reward_generation/mt_score_generate.sh \
--LM models--Qwen--Qwen2.5-Math-7B-Instruct \
--ORIGIN models--Qwen--Qwen2.5-7B-Instruct \
--round 0 \
--bs 4 \
--mt 6000 \
--n_gpus 1 \
--task math \
--loop 1
Generate reasoning data
# example of math
python rationale_generation/process.py \
--model_path "Qwen/QwQ-32B" \
--data_path _output/monte_carlo_processed/math_train_Qwen2.5-Math-7B-Instruct_monte_carlo \
--save_path _output/reasoning_output/math_train_QwQ_reasoning \
--num_gpu_per 1 \
--majority_of_N 1
Execute policy refinement based on GenPRM's split output
python prm_evaluation/policy_refine.py \
--model_path "Qwen/Qwen2.5-7B-Instruct" \
--data_path "_output/split_output/..."\
--split_out "_output/split_refine/..."
Note
Our mathematical expression evaluation code is based on Qwen2.5-Math. For a more powerful evaluator, please refer to this repository: Math-Verify.
If you find this work helpful, please kindly cite our paper:
@article{zhao2025genprm,
title = {GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning},
author = {Jian Zhao and Runze Liu and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
journal = {arXiv preprint arXiv:2504.00891},
year = {2025}
}
Our collection of PRMs in Awesome-Process-Reward-Models:
@misc{Awesome-Process-Reward-Models,
title = {Awesome Process Reward Models},
author = {Runze Liu and Jian Zhao and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
howpublished = {\url{https://github.com/RyanLiu112/Awesome-Process-Reward-Models}},
note = {GitHub repository},
year = {2025}
}
Our recent work on LLM test-time scaling with PRMs:
@article{liu2025can,
title = {Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling},
author = {Runze Liu and Junqi Gao and Jian Zhao and Kaiyan Zhang and Xiu Li and Biqing Qi and Wanli Ouyang and Bowen Zhou},
journal = {arXiv preprint arXiv:2502.06703},
year = {2025}
}
The model training is based on axolotl and RLHFlow. The mathematical evaluation code is based on Qwen2.5-Math.