Offical Implementation of our research:
IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis [CIRL 2025 🚀]
Website page:
Authors:
Shitong Shao, Zikai Zhou, Lichen Bai, Haoyi Xiong, and Zeke Xie*
HKUST (Guangzhou) and Baidu Inc.
*: Corresponding author
Abstract:
The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI's Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.
Motivation:
UMTScore (↑) vs. UMT-FVD (↓) with Animatediff on Chronomagic-Bench-150. In the legend, "R", "I", and "V" represent the score function estimation using random Gaussian noise, IDM, and VDM, respectively. Moreover, the front of the horizontal line "-" refers to the additive noise form, while the back of "-" represents the denoising paradigm. For instance, "RR-II" stands for a two-step of adding noise with Guassian noise followed by two-step of denoising performed using IDM.
We enhance the quality of each frame at every denoising step by performing the following additional operation: 1) first adding Gaussian noise and 2) then denoising using IDMs. Unfortunately, as illustrated in the above figure, the approach "R-[.]", which use Gaussian noise to perform the forward diffusion process, result in significantly lower quality of the synthesized video compared to the standard DDIM process (i.e., Origin in the above figure). This phenomenon arises because "R-[.]" over-introduces invalid information (i.e., Gaussian noise) into the synthesized video during denoising. Given this, we consider the more robust deterministic sampling method to integrate the video denoising process with the image denoising process, as this paradigm is stable and effectively reduces truncation errors in practical discrete sampling.
- All experiments are conducted on a single RTX 4090 GPU (24 GB).
Install the dependencies:
conda create -n ivs python=3.10.14
conda activate ivs
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers einops wandb accelerate pandas imageio imageio-ffmpeg
Run the following command to generate the video.
python inference.py --prompt "a beautiful girl" --pipeline Animatediff # pipeline: Animatediff, ModelScope, VideoCrafter
- The hyperparameters
--interval_begin
and--interval_end
are used to specify the start and end timesteps for the video generation. - The hyperparameters
--zz
used to control the balance between the temporal coherence and the visual quality of synthesized video. If--zz
is more large, the video generation process will pay more attention to the temporal coherence, otherwise, the video generation process will pay more attention to the visual quality. - The hyperparameters
--i_sigma_begin
and--i_sigma_end
are used to specify the start and end CFG scale for the image diffusion model. And--v_sigma_begin
and--v_sigma_end
are used to specify the start and end CFG scale for the video diffusion model.--rho
controls the concave and convex properties of the curve - The hyperparameters
--lora
is only used inAnimatediff
pipeline. The choice of--lora
isNone
,amechass
orbeauty
. You can choose the lora model to introduce desirable semantic information into the video generation process.
Currently all the lora is placed in the lora
folder, you need to convert it to diffusers mode via script ./lora/convert_lora_safetensor_to_diffusers.py
. Such as
python ./lora/convert_lora_safetensor_to_diffusers.py --lora_path --checkpoint_path "./lora/amechass.safetensors" --dump_path "./lora/amechass/" --base_model_path "/path/to/stable-diffusion-v1-5"
- Baseline Model with 512x512 resolution:
@inproceedings{
shao2025ivmixed,
title={{IV}-mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis},
author={Shitong Shao and zikai zhou and Bai LiChen and Haoyi Xiong and Zeke Xie},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=ImpeMDJfVL}
}
The code is built upon ViCo and diffusers. We thank the valuable code from the above repositories.