Skip to content

ali-vilab/CDT

Repository files navigation

Conditioned Diffusion-based Video Tokenizer (CDT)

Publication PRs License Stars

Official implementation for our paper:

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

Author List: Nianzu Yang, Pandeng Li, Liming Zhao, Yang Li, Chen-Wei Xie, Yehui Tang, Xudong Lu, Zhihang Liu, Yun Zheng, Yu Liu, Junchi Yan*

Equal contribution; * Corresponding author

Content

Folder Specification

├── evaluate.py  # script for evaluating the performance of CDT on reconstruction task
├── model # directory of CDT model
│   └── cdt.py # definition of CDT model
├── opensora_evaluate # scripts for metrics calculation
│   ├── cal_lpips.py # calculate LPIPS
│   ├── cal_psnr.py # calculate PSNR
│   └── cal_ssim.py # calculate SSIM
├── pretrained # directory of pretrained models, which you should create by yourself
│   ├── cdt_base.ckpt # CDT-base
│   └── cdt_small.ckpt # CDT-small
├── README.md # README
├── rec_image_eval.py # script for evaluating the performance of CDT on image reconstruction task
├── rec_video_eval.py # script for evaluating the performance of CDT on video reconstruction task
├── requirements.txt # dependencies
└── utils.py # utility functions

Preparation

Environment Setup

You can create a new environment and install the dependencies by running the following command:

conda create -n cdt python=3.10
conda activate cdt
pip install -r requirements.txt

Download Pre-trained Models

We provide the pre-trained models, i.e., CDT-base and CDT-small, on Hugging Face. You can download them and put them in the pretrained folder.

Prepare Data

In our paper, we use two datasets for benchmarking the reconstruction performance:

  • COCO2017-val for image reconstruction
  • Webvid-val for video reconstruction

You can download these two datasets and put them in the 'data' folder. Next, you need to specify the path of these two datasets in the 'evaluate.py' file.

Evaluation

Evaluate the performance of CDT-base on image reconstruction:

python evaluate.py --method CDT-base --dataset coco17  --mode image

Evaluate the performance of CDT-base on video reconstruction at the 256x256 resolution:

python evaluate.py --method CDT-base --dataset webvid --mode video

Evaluate the performance of CDT-base on video reconstruction at the 720x720 resolution:

python evaluate.py --method CDT-base --dataset webvid --mode video --size 720

Evaluate the performance of CDT-small on image reconstruction:

python evaluate.py --method CDT-small --dataset coco17  --mode image

Evaluate the performance of CDT-small on video reconstruction at the 256x256 resolution:

python evaluate.py --method CDT-small --dataset webvid --mode video

Evaluate the performance of CDT-small on video reconstruction at the 720x720 resolution:

python evaluate.py --method CDT-small --dataset webvid --mode video --size 720

The reconstructed images or videos and the evaluation results will be saved in the reconstructed_results folder.

Citation

If you find this work useful in your research, please consider citing:

@article{yang2025rethinking,
  title={Rethinking Video Tokenization: A Conditioned Diffusion-based Approach},
  author={Yang, Nianzu and Li, Pandeng and Zhao, Liming and Li, Yang and Xie, Chen-Wei and Tang, Yehui and Lu, Xudong and Liu, Zhihang and Zheng, Yun and Liu, Yu and Yan, Junchi},
  journal={arXiv preprint arXiv:2503.03708},
  year={2025}
}

Acknowledgement

We would like to thank the authors of Open-Sora-Plan for their excellent work, which provides the code for the evaluation metrics.

Contact

Welcome to contact us [email protected] for any question.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages