Skip to content

[ACL 2025] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"

Notifications You must be signed in to change notification settings

opendatalab/Meta-rater

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

arXiv Hugging Face Models Dataset Dataset

Advancing LLM pre-training efficiency through multi-dimensional data quality assessment

๐ŸŽฏ Overview

The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data qualityโ€”a critical driver of model performance. Meta-rater introduces a groundbreaking multi-dimensional data selection framework that doubles convergence speed and improves downstream task performance by 3.23% compared to random selection.

๐Ÿ† Key Achievements

  • ๐Ÿ“ˆ 2x Faster Convergence: Meta-rater achieves equivalent performance using only 15B tokens compared to 30B tokens with random selection
  • ๐ŸŽฏ 3.23% Performance Gain: Significant improvement over random sampling on downstream tasks
  • ๐Ÿ” Multi-dimensional Quality Assessment: Novel PRRC framework (Professionalism, Readability, Reasoning, Cleanliness)
  • ๐Ÿ“Š Scalable Framework: Benefits persist and increase from 1.3B to 7.2B parameter models
  • ๐Ÿ—๏ธ Comprehensive Dataset: First fully annotated 627B-token SlimPajama with 25 quality metrics

๐Ÿง  PRRC Framework

We introduce four novel evaluation dimensions to comprehensively assess data quality:

Dimension Description F1 Score
๐ŸŽ“ Professionalism Degree of expertise and technical knowledge required 91.57%
๐Ÿ“– Readability Ease of understanding and text clarity 87.47%
๐Ÿงฎ Reasoning Complexity of logical thinking and analysis 89.59%
โœจ Cleanliness Format quality and noise-free content 87.88%

๐Ÿ”ฌ Meta-rater Methodology

Our framework integrates 25 quality scores across three categories:

  1. Natural Language Quality Signals (11): Rule-based measures from RedPajama
  2. Data Importance Scores (3): DSIR similarity to Books, Wikipedia, and AutoMathText
  3. Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC

Algorithm Overview

# Simplified Meta-rater workflow
for i in range(N_proxy_models):
    weights = generate_random_weights(25)  # Random combination weights
    selected_data = select_top_k(data, weights @ quality_scores)
    proxy_model = train_small_model(selected_data)
    validation_loss = evaluate(proxy_model, validation_set)
    
regression_model = fit_regression(weights, validation_losses)
optimal_weights = find_minimum(regression_model)
final_data = select_top_k(data, optimal_weights @ quality_scores)

๐Ÿ“Š Results

Main Results (1.3B Models, 30B Tokens)

Method General Knowledge Commonsense Reasoning Reading Comprehension Average
Random Baseline 52.79 43.94 30.02 43.78
QuRating-Educational Value 57.66 46.72 28.10 46.16
Meta-rater (All 25) 58.90 45.41 31.55 47.01

Scaling Results

Model Size Method Average Performance Improvement
3.3B Random 52.98% -
3.3B Meta-rater 54.71% +1.73%
7.2B Random 52.12% -
7.2B Meta-rater 55.24% +3.12%

๐Ÿ’ก Key Insight: Meta-rater benefits increase with model scale, demonstrating that quality data selection becomes more valuable for larger models.

๐Ÿ“ฆ Available Resources

๐Ÿค– Pre-trained Models

Model Size Training Tokens Selection Method Performance on Downstream Tasks HF Link
Meta-rater-1.3B 1.3B 30B All 25 scores 47.01% Model
Meta-rater-3.3B 3.3B 100B All 25 scores 54.71% Model
Meta-rater-7.2B 7.2B 150B All 25 scores 55.24% Model

๐ŸŽฏ PRRC Rating Models

Model Dimension F1 Score on Test set HF Link
Professionalism Expertise assessment 91.57% Model
Readability Text clarity rating 87.47% Model
Reasoning Logic complexity assessment 89.59% Model
Cleanliness Format quality evaluation 87.88% Model

๐Ÿ“Š Datasets

  • Annotated SlimPajama-627B: Dataset

    • 627B tokens with 25 quality scores per document
    • First fully annotated large-scale pre-training dataset
    • Ready for research and production use
  • Top 30B token SlimPajama Subset selected by the Professionalism rater: Dataset

  • Top 30B token SlimPajama Subset selected by the Readability rater: Dataset

  • Top 30B token SlimPajama Subset selected by the Reasoning rater: Dataset

  • Top 30B token SlimPajama Subset selected by the Cleanliness rater: Dataset

๐Ÿ“ PRRC Rating Prompts

All prompts for rating PRRC dimensions (Professionalism, Readability, Reasoning, Cleanliness) are provided in the prompts/ directory:

  • prompts/professionalism.txt
  • prompts/readability.txt
  • prompts/reason.txt
  • prompts/cleanliness.txt

๐Ÿ› ๏ธ Training & Evaluation Scripts

The scripts/ directory contains shell scripts for training and evaluating PRRC raters:

  • scripts/prrc.sh: Training script for PRRC raters
  • scripts/evaluation.sh: Evaluation script for PRRC raters

๐Ÿง‘โ€๐Ÿ’ป Source Code for PRRC Raters

The src/ directory contains Python code for training and evaluating PRRC raters:

  • src/train_singletask.py: Training script for a single PRRC dimension
  • src/test_singletask.py: Evaluation script for a single PRRC dimension
  • src/utils.py: Utility functions for data processing and model management

These resources enable full reproducibility of PRRC rater training, evaluation, and prompt-based annotation.

๐Ÿ“ˆ Computational Efficiency

Meta-rater is designed for efficiency:

Process FLOPs (ร—10ยนโน) Percentage of 1.3B Training
Quality Score Rating 33.0 141%
Meta-rater Construction 0.18 0.8%
Total Overhead 33.2 142%

๐Ÿ’ก Note: Quality scores are computed once and reused across multiple experiments. For larger models (3.3B+), the overhead becomes negligible (17% for 3.3B training).

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ“š Citation

If you use Meta-rater in your research, please cite our paper:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

๐Ÿค Acknowledgments

  • Shanghai Artificial Intelligence Laboratory for computational resources
  • InternTrain Team for pre-training infrastructure support
  • Community contributors for valuable feedback and improvements

๐Ÿ“ž Contact


โญ Star us on GitHub if you find Meta-rater useful! โญ

Made with โค๏ธ by the OpenDataLab team

About

[ACL 2025] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •