The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data qualityโa critical driver of model performance. Meta-rater introduces a groundbreaking multi-dimensional data selection framework that doubles convergence speed and improves downstream task performance by 3.23% compared to random selection.
- ๐ 2x Faster Convergence: Meta-rater achieves equivalent performance using only 15B tokens compared to 30B tokens with random selection
- ๐ฏ 3.23% Performance Gain: Significant improvement over random sampling on downstream tasks
- ๐ Multi-dimensional Quality Assessment: Novel PRRC framework (Professionalism, Readability, Reasoning, Cleanliness)
- ๐ Scalable Framework: Benefits persist and increase from 1.3B to 7.2B parameter models
- ๐๏ธ Comprehensive Dataset: First fully annotated 627B-token SlimPajama with 25 quality metrics
We introduce four novel evaluation dimensions to comprehensively assess data quality:
Dimension | Description | F1 Score |
---|---|---|
๐ Professionalism | Degree of expertise and technical knowledge required | 91.57% |
๐ Readability | Ease of understanding and text clarity | 87.47% |
๐งฎ Reasoning | Complexity of logical thinking and analysis | 89.59% |
โจ Cleanliness | Format quality and noise-free content | 87.88% |
Our framework integrates 25 quality scores across three categories:
- Natural Language Quality Signals (11): Rule-based measures from RedPajama
- Data Importance Scores (3): DSIR similarity to Books, Wikipedia, and AutoMathText
- Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC
# Simplified Meta-rater workflow
for i in range(N_proxy_models):
weights = generate_random_weights(25) # Random combination weights
selected_data = select_top_k(data, weights @ quality_scores)
proxy_model = train_small_model(selected_data)
validation_loss = evaluate(proxy_model, validation_set)
regression_model = fit_regression(weights, validation_losses)
optimal_weights = find_minimum(regression_model)
final_data = select_top_k(data, optimal_weights @ quality_scores)
Method | General Knowledge | Commonsense Reasoning | Reading Comprehension | Average |
---|---|---|---|---|
Random Baseline | 52.79 | 43.94 | 30.02 | 43.78 |
QuRating-Educational Value | 57.66 | 46.72 | 28.10 | 46.16 |
Meta-rater (All 25) | 58.90 | 45.41 | 31.55 | 47.01 |
Model Size | Method | Average Performance | Improvement |
---|---|---|---|
3.3B | Random | 52.98% | - |
3.3B | Meta-rater | 54.71% | +1.73% |
7.2B | Random | 52.12% | - |
7.2B | Meta-rater | 55.24% | +3.12% |
๐ก Key Insight: Meta-rater benefits increase with model scale, demonstrating that quality data selection becomes more valuable for larger models.
-
- 627B tokens with 25 quality scores per document
- First fully annotated large-scale pre-training dataset
- Ready for research and production use
-
Top 30B token SlimPajama Subset selected by the Professionalism rater:
-
Top 30B token SlimPajama Subset selected by the Readability rater:
-
Top 30B token SlimPajama Subset selected by the Reasoning rater:
-
Top 30B token SlimPajama Subset selected by the Cleanliness rater:
All prompts for rating PRRC dimensions (Professionalism, Readability, Reasoning, Cleanliness) are provided in the prompts/
directory:
prompts/professionalism.txt
prompts/readability.txt
prompts/reason.txt
prompts/cleanliness.txt
The scripts/
directory contains shell scripts for training and evaluating PRRC raters:
scripts/prrc.sh
: Training script for PRRC ratersscripts/evaluation.sh
: Evaluation script for PRRC raters
The src/
directory contains Python code for training and evaluating PRRC raters:
src/train_singletask.py
: Training script for a single PRRC dimensionsrc/test_singletask.py
: Evaluation script for a single PRRC dimensionsrc/utils.py
: Utility functions for data processing and model management
These resources enable full reproducibility of PRRC rater training, evaluation, and prompt-based annotation.
Meta-rater is designed for efficiency:
Process | FLOPs (ร10ยนโน) | Percentage of 1.3B Training |
---|---|---|
Quality Score Rating | 33.0 | 141% |
Meta-rater Construction | 0.18 | 0.8% |
Total Overhead | 33.2 | 142% |
๐ก Note: Quality scores are computed once and reused across multiple experiments. For larger models (3.3B+), the overhead becomes negligible (17% for 3.3B training).
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use Meta-rater in your research, please cite our paper:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
- Shanghai Artificial Intelligence Laboratory for computational resources
- InternTrain Team for pre-training infrastructure support
- Community contributors for valuable feedback and improvements
- Project Lead: Ren Ma ([email protected])
- Corresponding Author: Conghui He ([email protected])
- Issues: Please use GitHub Issues for bug reports and feature requests
โญ Star us on GitHub if you find Meta-rater useful! โญ
Made with โค๏ธ by the OpenDataLab team