MM-EUREKA

📖Report | 📊MMK12 Datasets & Benchmark | 🤗MM-Eureka-Qwen-7B | 🤗MM-Eureka-Qwen-32B

MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

[MM-Eureka Report Link], [MM-Eureka arxiv Link]

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

[CPGD Report Link], [CPGD arxiv Link]

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

[MM-PRM Code], [MM-PRM arxiv Link]

🎯Overview

We present MM-Eureka-Qwen-7B and MM-Eureka-Qwen-32B, both are powerful multimodal reasoning models that successfully extend large-scale rule-based reinforcement learning (RL) to multimodal reasoning. Compared to the previous version of MM-EUREKA based on InternVL, we have made improvements in model architecture, algorithms, and data. For instance, MM-Eureka-Qwen-7B achieves 66.1 on MMK12 evaluation sets, only 0.2 points below InternVL-2.5-78B. On MathVista(testmini), it reaches 73.0, even surpassing InternVLVL-2.5-78B. MM-Eureka-Qwen-32B demonstrates stronger performance, scoring 72.3 on MMK12 evaluation sets, which exceeds both Qwen2.5-VL-72B's 70.3 and closed-source models like Gemini2-Flash, ranking second only to o1's 73.9. On commonly used multimodal mathematical reasoning benchmarks, MM-Eureka-Qwen-32B achieves 73.4 on WeMath, outperforming all open-source models and most closed-source models including Claude3.7 Sonnet. On MathVista, it reaches 74.8, surpassing all open-source and closed-source models. Both variants demonstrate significant improvements in multidisciplinary K12 and mathematical reasoning performance, outperforming most open-source models of similar sizes.

Core Improvements:

We further iterate the codebase to support algorithms including Online Filter, ADORA, and DAPO.
We open-source self-collected MMK12, which has 15k diverse and high-quality samples and 2k MCQs for Math, Physics, Chemistry, Biology for evaluation.
We train the MM-Eureka-Qwen-7B and MM-Eureka-Qwen-32B, which are the almost top performer in multimodal reasoning within similar size open-source models. Especially for Multidisciplinary K12 tasks.

🔥We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at MM-EUREKA-Qwen.

🗞️ News

[2025/05/19] We released MM-PRM.
- 📖 Report: MM-PRM-Report
- 🤗 Model: MM-PRM
- 🚀Code: MM-PRM-Code
[2025/05/19] We proposed a novel RL algorithm called Clipped Policy Gradient Optimization with Policy Drift (CPGD), which is based on policy gradient loss with a clipping mechanism and a policy drift regularizer. In our experiments, we found that it is more stable and performs better than GRPO.
- 📖 Report: CPGD-Report, CPGD-arxiv
- 🤗 Model: MM-Eureka-CPGD-Qwen-7B
- 🚀Code: MM-Eureka-Qwen-Code
[2025/04/15] We released MM-Eureka-Qwen-7B , MM-Eureka-Qwen-32B and MMK12.
- 📖 Report: MM-Eureka-Qwen-Report, MM-Eureka-Qwen-arxiv
- 🤗 Model: MM-Eureka-Qwen-7B
- 🤗 Model: MM-Eureka-Qwen-32B
- 📊 Dataset: MMK12
- 🚀Code: MM-Eureka-Qwen-Code
[2025/03/27] We released MM-Eureka-Qwen.
- 📖 Report: MM-Eureka-Qwen-Report
- 🤗 Model: MM-Eureka-Qwen-7B
- 📊 Dataset: MM-Eureka-Dataset
- 🚀Code: MM-Eureka-Qwen-Code
[2025/03/07] We released MM-Eureka.
- 📖 Paper: MM-Eureka-paper
- 🤗 Model: MM-Eureka-8B & MM-Eureka-Zero-38B
- 📊 Dataset: MM-Eureka-Dataset
- 🚀Code: MM-Eureka-Code

🚀 Features

This repository is built upon OpenRLHF, introducing several key enhancements:

Multimodal RFT Support: Extends OpenRLHF to incorporate vision-language models (VLMs), currently supporting InternVL, enabling multimodal reasoning capabilities.
- Currently support RLOO, REINFORCE++, GRPO training using Ray.
- vLLM integration and distributed training.
- Support hybrid engine (--colocate_all_models, --vllm_enable_sleep).
Better Rule-based Reward support: Better training visualization for Rule-based Rewards (i.g. Format Reward, Accuracy Reward, Repetition Penalty)
Enhanced Online Filtering: Filtering out experiences based on Accuracy Reward during training as in PRIME
- Use --enable_accuracy_filter, --freezing_filter_steps, --accuracy_lower_bound, --accuracy_upper_bound to control the behavior of online accuracy filter.
ADORA: Enable Adaptive Online Rollout Adjustment by using --use_adora and --adora_lamda as in ADORA.
DAPO: You can use --use_dapo to enable DAPO loss during training as in DAPO.
CPGD: You can use --use_cpg_loss and --use_policy_drift to enable CPGD loss during training as in CPGD. Additionally:
- --policy_drift_coef controls the weight of the policy drift regularizer, and --policy_drift_clip_eps controls the clipping range in policy drift.
- --use_clip_filter_like_weight enables the clip-filter-like weight proposed in CPGD, and --clip_filter_like_weight_clip_eps controls the clipping range in the clip-filter-like weight.
- Example script is provided in MM-EUREKA/examples/scripts/train_cpgd_qwen_7b_single_node.sh or MM-EUREKA/examples/scripts/train_cpgd_qwen_7b_multi_node.sh.

🤖 Models

Based on the key factors identified by https://github.com/ModalMinds/MM-EUREKA for achieving stable training, we enhanced the model, dataset, and algorithmic modules. Specifically, we maintained the strategy of omitting the KL divergence term and applying data filtering, while implementing the following critical modifications:

The base model was upgraded from InternVL2.5-8B-Instruct to the more powerful Qwen2.5-VL-7B-Instruct.
The Vision Transformer (ViT) module was frozen during training.
The underlying RL algorithm was replaced with GRPO, instead of the previously used RLOO.
The data filtering strategy was transitioned from an offline approach to an online approach.
Additional data from the K12 dataset was collected, expanding the total dataset size to 15,000 samples.

Model	MathVista	MathVerse	MathVision	OlympiadBench	WeMath	MMK12
Claude3.7-Sonnet	66.8	52.0	41.3	48.9	72.6	55.3
GPT-4o	63.8	50.2	30.4	35.0	68.8	49.9
o1	73.9	57.0	60.3	68.0	98.7	73.9
Gemini2-flash	70.4	59.3	41.3	51.0	71.4	65.2
Qwen-2.5-VL-7B	68.2	47.9	25.4	20.2	62.1	53.6
Qwen-2.5-VL-32B	74.7/71.7	49.9	40.1	30.0	69.1	66.8
Qwen-2.5-VL-72B	74.8	57.6	38.1	40.4	72.4	70.5
InternVL2.5-VL-78B	72.3	51.7	32.2	31.1	66.3	61.6
QVQ-72B-Preview	71.4	48.2	35.9	33.2	65.4	61.5
Adora-7B	73.5	50.1	23.0	20.1	64.2	58.1
R1-Onevision-7B	64.1	47.1	29.9/23.5	17.3	61.8	39.8
MM-Eureka-Qwen-7B	73.0	50.3	26.9	20.1	66.1	64.5
MM-Eureka-Qwen-32B	74.8	56.5	34.4	35.9	73.4	72.2
MM-Eureka-CPGD-Qwen-7B	74.0	50.6	28.3	21.4	68.3	65.3

🏁 Getting Started

📦 Installation

git clone https://github.com/ModalMinds/MM-EUREKA.git
git checkout qwen
cd MM-EUREKA
pip install -e .[vllm]
pip install flash_attn --no-build-isolation

📂 Data Preparation

You can download our training data from MMK12

Once downloaded, refer to the section below for additional data formation.

Custom dataset

For custom dataset, format your data in to a JSONL file, where each entry is a dictionary organized in the following format.

{
  "id": "0",
  "message": "[{\"role\": \"user\", \"content\": [{\"type\": \"image\", \"image\": \"file:///path/to/your/image.jpg\"}, {\"type\": \"text\", \"text\": \"How many cats in the image?\"}]}]",
  "answer": "gt that could be parsed and verified by math_verify"
}

🌐 Start Training

Before starting your own training, ensure that the paths in the provided training scripts are correctly set and that environment variables like $MASTER_ADDR and $NODE_RANK are properly configured.

start MM-Eureka-Qwen-7B training

for single node

sh examples/scripts/train_mm_eureka_qwen_7b_single_node.sh

for multiple node

sh examples/scripts/train_mm_eureka_qwen_7b_multi_node.sh

⭐ Starchart

🤝 Contribution

MM-Eureka is still under active development, if you want to contribute, please feel free to make a pull request or create an issue.

Please refer to CONTRIBUTING.md before you dive in！

📬 Contact

If you have any questions or would like to engage with our community, feel free to scan the QR code below to join our WeChat group.

🎓 Acknowledgements

We acknowledge the outstanding open-source contributions from OpenRLHF, LMM-R1 and vLLM. We also extend our gratitude to DeepSeek-R1, InternVL and QwenVL for their open-source techniques and base models, which have enabled us to further our exploration.

📜 Citation

@article{meng2025mmeureka,
      title={MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning},
      author={Fanqing Meng and Lingxiao Du and Zongkai Liu and Zhixiang Zhou and Quanfeng Lu and Daocheng Fu and Tiancheng Han and Botian Shi and Wenhai Wang and Junjun He and Kaipeng Zhang and Ping Luo and Yu Qiao and Qiaosheng Zhang and Wenqi Shao},
      year={2025},
      journal={arXiv preprint arXiv:2503.07365},
}
@article{du2025mmprm,
      title={MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision},
      author={Lingxiao Du and Fanqing Meng and Zongkai Liu and Zhixiang Zhou and Ping Luo and Qiaosheng Zhang and Wenqi Shao},
      year={2025},
      journal={arXiv preprint arXiv:2505.13427},
}
@article{liu2025cpgd,
      title={CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models},
      author={Zongkai Liu and Fanqing Meng and Lingxiao Du and Zhixiang Zhou and Chao Yu and Wenqi Shao and Qiaosheng Zhang},
      year={2025},
      journal={arXiv preprint arXiv:2505.12504},
}

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
docs		docs
eval		eval
examples		examples
openrlhf		openrlhf
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
CPGD_Tech_Report.pdf		CPGD_Tech_Report.pdf
LICENSE		LICENSE
MM_EUREKA_Tech_Report.pdf		MM_EUREKA_Tech_Report.pdf
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM-EUREKA

🎯Overview

🗞️ News

🚀 Features

🤖 Models

🏁 Getting Started

📦 Installation

📂 Data Preparation

Custom dataset

🌐 Start Training

⭐ Starchart

🤝 Contribution

📬 Contact

🎓 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

ModalMinds/MM-EUREKA

Folders and files

Latest commit

History

Repository files navigation

MM-EUREKA

🎯Overview

🗞️ News

🚀 Features

🤖 Models

🏁 Getting Started

📦 Installation

📂 Data Preparation

Custom dataset

🌐 Start Training

⭐ Starchart

🤝 Contribution

📬 Contact

🎓 Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages