Skip to content

Jiacheng8/CV-DD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Distillation via Committee Voting (CV-DD)

Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen

arXiv | BibTeX

This is an official PyTorch implementation of the paper Dataset Distillation via Committee Voting. In this work, we:

  • We propose a novel framework, Committee Voting for Dataset Distillation (CV-DD), which integrates multiple model perspectives to synthesize a distilled dataset that encapsulates rich features and produces high-quality soft labels by batch-specific normalization.

  • By integrating recent advancements, refining framework design and optimization techniques, we establish a strong baseline within CV-DD framework that already achieves state-of-the-art performance in dataset distillation.

  • Through experiments across multiple datasets, we demonstrate that CV-DD improves generalization, mitigates overfitting, and outperforms prior methods in various datalimited scenarios, highlighting its effectiveness as a scalable and reliable solution for dataset distillation.

Abstract

Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce Committee Voting for Dataset Distillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models, our method captures a broader range of data features, mitigates model-specific biases, and enhances generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation.

Distilled Images

The distilled images for different datasets can be found here. We provide IPC values of 1, 10, and 50 for CV-DD, and IPC50 for $\text{SRe}^2\text{L}^{++}$. If you need IPC=1 or 10, you can simply sample them from the IPC50 dataset. The code is following:

import os
import shutil

def sample_images(source_dir, target_dir, num_images=10):
    os.makedirs(target_dir, exist_ok=True)

    for subdir in sorted(os.listdir(source_dir)):
        sub_path = os.path.join(source_dir, subdir)
        if os.path.isdir(sub_path):
            images = sorted([f for f in os.listdir(sub_path) if f.endswith(('.jpg', '.png', '.jpeg'))])
            
            if len(images) >= num_images:
                sampled_images = images[:num_images]
                
                target_subdir = os.path.join(target_dir, subdir)
                os.makedirs(target_subdir, exist_ok=True)
                
                for image in sampled_images:
                    src_image_path = os.path.join(sub_path, image)
                    dest_image_path = os.path.join(target_subdir, image)
                    shutil.copy(src_image_path, dest_image_path)
            else:
                print(f"sub-directory {subdir} don't contain {num_images} images, skipping...")


source_directory = "" # Remember to change me to desired directory
target_directory = "" # Remember to change me to desired directory
sample_number = 0 # the number of sampled files
sample_images(source_directory, target_directory, num_images=sample_number)

Overall Configuration

To ensure the functionality of the code, please kindly download some required materials from the Google Drive Link and store them in a specific folder. The name of this folder is not limited. However, in this folder, we expect several sub-folders:

  • patches/
  • offline_models/
  • test_data/

Please ensure the names match exactly.

We expect the following format for storing the required data:

CV_DD_data/
├── offline_models/
│   ├── cifar10/
│   └── cifar100/
│   └── imagenet-nette/
│   └── tiny_imagenet/
├── patches/
│   ├── cifar10/
│   │   └── medium/
│   └── cifar100/
│   │   └── medium/
│   └── imagenet-nette/
│   │   └── medium/
│   └── tiny_imagenet/
│       └── medium/
└── test_data/
    ├── cifar10/
    └── cifar100/
    └── imagenet-nette/
    └── tiny_imagenet/

Note: Each folder under offline_models must contain 5 pretrained models, which is ResNet18, ResNet50, DenseNet121, ShuffleNetV2, and MobileNetV2.

After downloading the required files and organizing them into the desired format, update the Main_Data_path in the config.sh file to the absolute path of the data directory you created, For example, in the above format, you should enter the absolute path for CV_DD_data.

Squeeze

This process compresses the information of the original training data into various models. We provide scripts for compressing different models on different datasets. More details can be found in squeeze/README.md.

Recover

This process generates the distilled data using two Models' distribution and prediction with prior information. We provide scripts for distilling multiple dataset on different IPC settings. More details can be found in recover/README.md.

Relabel

This process generates high quaility soft labels by using BSSL technique. We provide scripts for generating soft labels for different dataset. More details can be found in relabel/README.md.

Validate

This process validates the quality of the distilled data and the soft labels. More details can be found in validate/README.md.

Results

Our Top-1 accuracy (%) under different IPC settings across various datasets, compared with different state-of-the-art (SOTA) methods, is summarized in the table below: Overview

Bibliography

If you find this repository helpful for your project, please consider citing our work:

@misc{cui2025datasetdistillationcommitteevoting,
      title={Dataset Distillation via Committee Voting}, 
      author={Jiacheng Cui and Zhaoyi Li and Xiaochen Ma and Xinyue Bi and Yaxin Luo and Zhiqiang Shen},
      year={2025},
      eprint={2501.07575},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.07575}, 
}

About

Dataset Distillation via Committee Voting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published