Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen
This is an official PyTorch implementation of the paper Dataset Distillation via Committee Voting. In this work, we:
-
We propose a novel framework, Committee Voting for Dataset Distillation (CV-DD), which integrates multiple model perspectives to synthesize a distilled dataset that encapsulates rich features and produces high-quality soft labels by batch-specific normalization.
-
By integrating recent advancements, refining framework design and optimization techniques, we establish a strong baseline within CV-DD framework that already achieves state-of-the-art performance in dataset distillation.
-
Through experiments across multiple datasets, we demonstrate that CV-DD improves generalization, mitigates overfitting, and outperforms prior methods in various datalimited scenarios, highlighting its effectiveness as a scalable and reliable solution for dataset distillation.
Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce Committee Voting for Dataset Distillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models, our method captures a broader range of data features, mitigates model-specific biases, and enhances generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation.
The distilled images for different datasets can be found here. We provide IPC values of 1, 10, and 50 for CV-DD, and IPC50 for
import os
import shutil
def sample_images(source_dir, target_dir, num_images=10):
os.makedirs(target_dir, exist_ok=True)
for subdir in sorted(os.listdir(source_dir)):
sub_path = os.path.join(source_dir, subdir)
if os.path.isdir(sub_path):
images = sorted([f for f in os.listdir(sub_path) if f.endswith(('.jpg', '.png', '.jpeg'))])
if len(images) >= num_images:
sampled_images = images[:num_images]
target_subdir = os.path.join(target_dir, subdir)
os.makedirs(target_subdir, exist_ok=True)
for image in sampled_images:
src_image_path = os.path.join(sub_path, image)
dest_image_path = os.path.join(target_subdir, image)
shutil.copy(src_image_path, dest_image_path)
else:
print(f"sub-directory {subdir} don't contain {num_images} images, skipping...")
source_directory = "" # Remember to change me to desired directory
target_directory = "" # Remember to change me to desired directory
sample_number = 0 # the number of sampled files
sample_images(source_directory, target_directory, num_images=sample_number)
To ensure the functionality of the code, please kindly download some required materials from the Google Drive Link and store them in a specific folder. The name of this folder is not limited. However, in this folder, we expect several sub-folders:
patches/
offline_models/
test_data/
Please ensure the names match exactly.
We expect the following format for storing the required data:
CV_DD_data/
├── offline_models/
│ ├── cifar10/
│ └── cifar100/
│ └── imagenet-nette/
│ └── tiny_imagenet/
├── patches/
│ ├── cifar10/
│ │ └── medium/
│ └── cifar100/
│ │ └── medium/
│ └── imagenet-nette/
│ │ └── medium/
│ └── tiny_imagenet/
│ └── medium/
└── test_data/
├── cifar10/
└── cifar100/
└── imagenet-nette/
└── tiny_imagenet/
Note: Each folder under offline_models
must contain 5 pretrained models, which is ResNet18, ResNet50, DenseNet121, ShuffleNetV2, and MobileNetV2.
After downloading the required files and organizing them into the desired format, update the Main_Data_path
in the config.sh file to the absolute path of the data directory you created, For example, in the above format, you should enter the absolute path for CV_DD_data
.
This process compresses the information of the original training data into various models. We provide scripts for compressing different models on different datasets. More details can be found in squeeze/README.md.
This process generates the distilled data using two Models' distribution and prediction with prior information. We provide scripts for distilling multiple dataset on different IPC settings. More details can be found in recover/README.md.
This process generates high quaility soft labels by using BSSL technique. We provide scripts for generating soft labels for different dataset. More details can be found in relabel/README.md.
This process validates the quality of the distilled data and the soft labels. More details can be found in validate/README.md.
Our Top-1 accuracy (%) under different IPC settings across various datasets, compared with different state-of-the-art (SOTA) methods, is summarized in the table below:
If you find this repository helpful for your project, please consider citing our work:
@misc{cui2025datasetdistillationcommitteevoting,
title={Dataset Distillation via Committee Voting},
author={Jiacheng Cui and Zhaoyi Li and Xiaochen Ma and Xinyue Bi and Yaxin Luo and Zhiqiang Shen},
year={2025},
eprint={2501.07575},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.07575},
}