IV-mixed Sampler

Offical Implementation of our research:

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis [CIRL 2025 🚀]

Website page:

https://klayand.github.io/IVmixedSampler

Authors:

Shitong Shao, Zikai Zhou, Lichen Bai, Haoyi Xiong, and Zeke Xie*
HKUST (Guangzhou) and Baidu Inc.
*: Corresponding author

Abstract:

The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI's Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.

Motivation:

UMTScore (↑) vs. UMT-FVD (↓) with Animatediff on Chronomagic-Bench-150. In the legend, "R", "I", and "V" represent the score function estimation using random Gaussian noise, IDM, and VDM, respectively. Moreover, the front of the horizontal line "-" refers to the additive noise form, while the back of "-" represents the denoising paradigm. For instance, "RR-II" stands for a two-step of adding noise with Guassian noise followed by two-step of denoising performed using IDM.

We enhance the quality of each frame at every denoising step by performing the following additional operation: 1) first adding Gaussian noise and 2) then denoising using IDMs. Unfortunately, as illustrated in the above figure, the approach "R-[.]", which use Gaussian noise to perform the forward diffusion process, result in significantly lower quality of the synthesized video compared to the standard DDIM process (i.e., Origin in the above figure). This phenomenon arises because "R-[.]" over-introduces invalid information (i.e., Gaussian noise) into the synthesized video during denoising. Given this, we consider the more robust deterministic sampling method to integrate the video denoising process with the image denoising process, as this paradigm is stable and effectively reduces truncation errors in practical discrete sampling.

Requirements

All experiments are conducted on a single RTX 4090 GPU (24 GB).

Get_started

Install the dependencies:

conda create -n ivs python=3.10.14
conda activate ivs
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers einops wandb accelerate pandas imageio imageio-ffmpeg

Run the following command to generate the video.

python inference.py --prompt "a beautiful girl" --pipeline Animatediff # pipeline: Animatediff, ModelScope, VideoCrafter

Important Notes

The hyperparameters --interval_begin and --interval_end are used to specify the start and end timesteps for the video generation.
The hyperparameters --zz used to control the balance between the temporal coherence and the visual quality of synthesized video. If --zz is more large, the video generation process will pay more attention to the temporal coherence, otherwise, the video generation process will pay more attention to the visual quality.
The hyperparameters --i_sigma_begin and --i_sigma_end are used to specify the start and end CFG scale for the image diffusion model. And --v_sigma_begin and --v_sigma_end are used to specify the start and end CFG scale for the video diffusion model. --rho controls the concave and convex properties of the curve
The hyperparameters --lora is only used in Animatediff pipeline. The choice of --lora is None, amechass or beauty. You can choose the lora model to introduce desirable semantic information into the video generation process.

Download Lora

Currently all the lora is placed in the lora folder, you need to convert it to diffusers mode via script ./lora/convert_lora_safetensor_to_diffusers.py. Such as

python ./lora/convert_lora_safetensor_to_diffusers.py --lora_path --checkpoint_path "./lora/amechass.safetensors" --dump_path "./lora/amechass/" --base_model_path "/path/to/stable-diffusion-v1-5"

Visualization

Baseline Model with 512x512 resolution:

AnimateDiff (Standard)	AnimateDiff (IV-mixed Sampler)	VideoCrafter (Standard)	VideoCrafter (IV-mixed Sampler)

"Two horses race across a grassy field at sunset"	"Two horses race across a grassy field at sunset"	"Two horses race across a grassy field at sunset"	"Two horses race across a grassy field at sunset"

"Two horses race across a grassy field at sunset"	"Two horses race across a grassy field at sunset"	"A snowy mountain peak, a lone skier carving through powder"	"A snowy mountain peak, a lone skier carving through powder"
ModelScope (Standard)	ModelScope (IV-mixed Sampler)	AnimateDiff (Standard)	AnimateDiff (IV-mixed Sampler)

"A snowy mountain peak, a lone skier carving through powder"	"A snowy mountain peak, a lone skier carving through powder"	"A snowy mountain peak, a lone skier carving through powder"	"A snowy mountain peak, a lone skier carving through powder"
VideoCrafter (Standard)	VideoCrafter (IV-mixed Sampler)	ModelScope (Standard)	ModelScope (IV-mixed Sampler)

LoRA in Animatediff, sourced from AmechaSSS and Beauty:

Standard	IV-mixed Sampler (vanilla)	IV-mixed Sampler (amechass)	IV-mixed Sampler (beauty)

"masterpiece, best quality,realistic,(realskin:1.5),1girl,school,longhair,no_bangs, side_view,looking at viewer,school uniform,realskin, softlight"	"masterpiece, best quality,realistic,(realskin:1.5),1girl,school,longhair,no_bangs, side_view,looking at viewer,school uniform,realskin, softlight"	"masterpiece, best quality,realistic,(realskin:1.5),1girl,school,longhair,no_bangs, side_view,looking at viewer,school uniform,realskin, softlight"	"masterpiece, best quality,realistic,(realskin:1.5),1girl,school,longhair,no_bangs, side_view,looking at viewer,school uniform,realskin, softlight"

Citation

@inproceedings{
shao2025ivmixed,
title={{IV}-mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis},
author={Shitong Shao and zikai zhou and Bai LiChen and Haoyi Xiong and Zeke Xie},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=ImpeMDJfVL}
}

Acknowledgments

The code is built upon ViCo and diffusers. We thank the valuable code from the above repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
lora		lora
samples		samples
utils		utils
vico		vico
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IV-mixed Sampler

Requirements

Get_started

Important Notes

Download Lora

Visualization

Citation

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

xie-lab-ml/IV-mixed-Sampler

Folders and files

Latest commit

History

Repository files navigation

IV-mixed Sampler

Requirements

Get_started

Important Notes

Download Lora

Visualization

Citation

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages