DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.

REAL Lab, University of California, Santa Cruz

🎉🎉 News

[2025.03.13] 🔥🔥 Release LLM generated raw scores (GPT-4o-mini, LLM-3.1-8b-inst, Mistral-7b-v0.3-inst) for TULU_300k in Huggingface!
[2025.01.22] 👏👏 Accepted by ICLR 2025.
[2024.11.10] 📢📢 Release the curated dataset in Huggingface.
[2024.10.08] 🚀🚀 Release the code of DS2.

Brief Introduction

This project is motivated by the frequent and widespread errors in LLM-generated raw rating scores, which can vary significantly across different models. The score errors can be visualized by a score transition matrix (A larger value on the matrix’s diagonal indicates that the score error is smaller)

In response, we introduce DS2, a diversity-aware score curation approach to enhance data selection.

Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.

Dataset preparation

One can download the evaluation/training data by

# eval data
bash model_finetune/prepare_eval_data.sh

# train data
bash model_finetune/prepare_train_data.sh

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. One can obtain the LLM-generated rating score by:

#Open-source LLMs
cd LLM_scoring && bash scoring.sh

# Api call
cd LLM_scoring && bash scoring_api.sh

🧩 Step 2. Score curation

One can execute the score curation by running

cd score_curation && bash diagnose.sh

The corresponding curation report files can be found in the path score_curation_results/.

🧩 Step 3. Data selection

Given the generated score curation reports, one can directly generate the high-quality subset by

python subset_generation.py

The generated subsets can be further used for the following LLM instruction tuning.

🧩 Step 4. Finetune & Evaluation

The generated subsets in the selected_data path can be used for LLM instruction tuning. Here, for easily reproduction, one can directly finetune the models by (Codebase: TULU)

cd model_finetune && bash run_pipeline.sh

Citation

If you used this repository, please cite our work:

@article{pang2024improving,
  title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
  author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
  journal={International Conference on Learning Representations},
  year={2025}
}

Name	Name	Last commit message	Last commit date
Latest commit JlPang863 update Mar 13, 2025 ab0ee4b · Mar 13, 2025 History 110 Commits
LLM_scoring	LLM_scoring	Format string correction for Python 3.10 and above.	Feb 26, 2025
figs	figs	update readme	Feb 11, 2025
model_finetune	model_finetune	iclr code update	Feb 11, 2025
raw_scores	raw_scores	LLM generated raw scores	Mar 13, 2025
score_curation	score_curation	iclr code update	Feb 11, 2025
.gitattributes	.gitattributes	init	Oct 15, 2024
.gitignore	.gitignore	update	Mar 13, 2025
LICENSE	LICENSE	init	Oct 15, 2024
README.md	README.md	update readme	Mar 13, 2025
requirements.txt	requirements.txt	iclr update	Feb 11, 2025
subset_generation.py	subset_generation.py	iclr update	Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

License

UCSC-REAL/DS2

Folders and files

Latest commit

History

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages