Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

[2024/12/10] 🔥 RCDMs is accepted by AAAI 2025.
[2024/08/08] 🔥 We release the train and test code of RCDMs.
[2024/07/02] 🔥 We release the paper of RCDMs for story generation.

🚀 Abstract:

Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation’s semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios.

🔥 Examples

🏷️ Introduction

Story visualization aims to depict a continuous narrative through multiple captions and reference clips. It has profound applications in game development and comic drawing. Due to the technological leaps in generative models, text-to-image synthesis methods can now generate visually faithful images through text descriptions. However, generating a continuous story with style and temporal consistency still poses significant challenges. Our proposed Rich-contextual Conditional Diffusion Models (RCDMs) tackle these issues by introducing a two-stage diffusion model framework that incorporates rich contextual information at both the image and feature levels.

Dataset Prepare

Dataset preparation follows the workflow outlined in ARLDM.

🔧 Requirements

Python >= 3.8 (Recommend to use Anaconda or Miniconda)
PyTorch >= 2.0.0
cuda==11.8

conda create --name rcdms python=3.8.10
conda activate rcdms
pip install -U pip

# Install requirements
pip install -r requirements.txt

🎉 How to Use

1. How to train

# stage1
sh run_stage1_PororoSV.sh  or sh run_stage1_FlintstonesSV.sh
# stage2
sh run_stage2_PororoSV.sh  or sh run_stage2_FlintstonesSV.sh

2. How to test

# stage1
python3 stage1_batchtest_rcdms_model.py
# stage2
python3 stage2_batchtest_rcdms_model.py

📝 Citation

If you find RCDMs useful for your research and applications, please cite using this BibTeX:

@article{shen2024boosting,
  title={Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models},
  author={Shen, Fei and Ye, Hu and Liu, Sibo and Zhang, Jun and Wang, Cong and Han, Xiao and Yang, Wei},
  journal={arXiv preprint arXiv:2407.02482},
  year={2024}
}

👉 Our other projects:

IMAGEdit: Training-Free Controllable Video Editing with Consistent Object Layout. [可控多目标视频编辑]
IMAGDressing: Controllable dressing generation. [可控穿衣生成]
IMAGGarment: Fine-grained controllable garment generation. [可控服装生成]
IMAGHarmony: Controllable image editing with consistent object layout. [可控多目标图像编辑]
IMAGPose: Pose-guided person generation with high fidelity. [可控多模式人物生成]
RCDMs: Rich-contextual conditional diffusion for story visualization. [可控故事生成]
PCDMs: Progressive conditional diffusion for pose-guided image synthesis. [可控人物生成]
V-Express: Explores strong and weak conditional relationships for portrait video generation. [可控数字人生成]
FaceShot: Talkingface plugin for any character. [可控动漫数字人生成]
CharacterShot: Controllable and consistent 4D character animation framework. [可控4D角色生成]
StyleTailor: An Agent for personalized fashion styling. [个性化时尚Agent]
SignVip: Controllable sign language video generation. [可控手语生成]

📨 Contact

If you have any questions, please feel free to contact with me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
configs		configs
mydatasets		mydatasets
src		src
README.md		README.md
requirements.txt		requirements.txt
run_stage1_FlintstonesSV.sh		run_stage1_FlintstonesSV.sh
run_stage1_PororoSV.sh		run_stage1_PororoSV.sh
run_stage2_FlintstonesSV.sh		run_stage2_FlintstonesSV.sh
run_stage2_PororoSV.sh		run_stage2_PororoSV.sh
stage1_batchtest_rcdms_model.py		stage1_batchtest_rcdms_model.py
stage2_batchtest_rcdms_model.py		stage2_batchtest_rcdms_model.py
train_stage1.py		train_stage1.py
train_stage2.py		train_stage2.py
zero_stage2_config.json		zero_stage2_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

🚀 Abstract:

🔥 Examples

🏷️ Introduction

Dataset Prepare

🔧 Requirements

🎉 How to Use

1. How to train

2. How to test

📝 Citation

👉 Our other projects:

📨 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

muzishen/RCDMs

Folders and files

Latest commit

History

Repository files navigation

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

🚀 Abstract:

🔥 Examples

🏷️ Introduction

Dataset Prepare

🔧 Requirements

🎉 How to Use

1. How to train

2. How to test

📝 Citation

👉 Our other projects:

📨 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages