Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
submission10095 committed Nov 24, 2021
0 parents commit 7573b88
Show file tree
Hide file tree
Showing 51 changed files with 6,262 additions and 0 deletions.
Empty file added LICENSE
Empty file.
203 changes: 203 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1cvb5xZY7NU6tgkpuYwiLIAOr9KPYJkv_)

<p align="center">
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/main1.png" />
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/main2.png" />
</p>

> **DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation**<br>
> Annonymous
>
>**Abstract**: <br>
> Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts.
> However, their applications to diverse real images are still difficult due to the limited GAN inversion capability.
> Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts.
> To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models.
> Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains.
> Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation.
> Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines.
## Description

This repo includes the official PyTorch implementation of DiffusionCLIP, a CLIP-based text-guided image manipulation method for Diffusion models.
DiffuionCLIP leverages the sampling and inversion processes based on [DDIM](https://arxiv.org/abs/2010.02502) sampling and its reversal,
which not only accelerate the manipulation but also enable nearly **perfect inversion**.
DiffusonCLIP can perform following tasks.

* Manipulation of Images in Trained Domain & to Unseen Domain
* Image Translation from Unseen Domain into Another Unseen Domain
* Generation of Images in Unseen Domain from Strokes
* Multiple attribute changes

With existing works, users often require **the combination of multiple models, tricky task-specific
loss designs or dataset preparation with large manual effort**. On the other hand, our method is **free
of such effort** and enables applications in a natural way with the original pretrained and fine-tuned
models by DiffusionCLIP.

The training process is illustreted in the following figure:

![](imgs/method.png)


## Getting Started

### Installation
We recommend running our code using:

- NVIDIA GPU + CUDA, CuDNN
- Python 3, Anaconda

To install our implementation, clone our repository and run following commands to install necessary packages:
```shell script
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=<CUDA_VERSION>
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
```
### Resources
- For the original fine-tuning, VRAM of 24 GB+ for 256x256 images are required.
- For the GPU-efficient fine-tuning, VRAM of 12 GB+ for 256x256 images and 24 GB+ for 512x512 images are required.
- For the inference, VRAM of 6 GB+ for 256x256 images and 9 GB+ for 512x512 images are required.

### Pretrained Models for DiffusionCLIP Fine-tuning

To manipulate soure images into images in CLIP-guided domain, the **pretrained Diffuson models** are required.

| Image Type to Edit |Size| Pretrained Model | Dataset | Reference Repo.
|---|---|---|---|---
| Human face |256×256| Diffusion (Auto), [IR-SE50](https://drive.google.com/file/d/1KW7bjndL3QG3sxBbZxreGHigcCCpsDgn/view) | [CelebA-HQ](https://arxiv.org/abs/1710.10196) | [SDEdit](https://github.com/ermongroup/SDEdit), [TreB1eN](https://github.com/TreB1eN/InsightFace_Pytorch)
| Church |256×256| Diffusion (Auto) | [LSUN-Bedroom](https://www.yf.io/p/lsun) | [SDEdit](https://github.com/ermongroup/SDEdit)
| Bedroom |256×256| Diffusion (Auto) | [LSUN-Church](https://www.yf.io/p/lsun) | [SDEdit](https://github.com/ermongroup/SDEdit)
| Dog face |256×256| [Diffusion](https://drive.google.com/file/d/14OG_o3aa8Hxmfu36IIRyOgRwEP6ngLdo/view) | [AFHQ-Dog](https://arxiv.org/abs/1912.01865) | [ILVR](https://github.com/jychoi118/ilvr_adm)
| ImageNet |512×512| [Diffusion](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/512x512_diffusion.pt) | [ImageNet](https://image-net.org/index.php) | [Guided Diffusion](https://github.com/openai/guided-diffusion)
- The pretrained Diffuson models on 256x256 images in [CelebA-HQ](https://arxiv.org/abs/1710.10196), [LSUN-Church](https://www.yf.io/p/lsun), and [LSUN-Bedroom](https://www.yf.io/p/lsun) are automatically downloaded in the code.
- In contrast, you need to download the models pretrained on [AFHQ-Dog-256](https://arxiv.org/abs/1912.01865) or [ImageNet-512](https://image-net.org/index.php) in the table and put it in `./pretrained` directory.
- In addtion, to use ID loss for preserving Human face identity, you are required to download the pretrained [IR-SE50](https://drive.google.com/file/d/1KW7bjndL3QG3sxBbZxreGHigcCCpsDgn/view) model from [TreB1eN](https://github.com/TreB1eN/InsightFace_Pytorch) and put it in `./pretrained` directory.


### Datasets
To precompute latents and fine-tune the Diffusion models, you need about 30+ images in the source domain. You can use both **sampled images** from the pretrained models or **real source images** from the pretraining dataset.
If you want to use [CelebA-HQ](https://drive.google.com/drive/folders/0B4qLcYyJmiz0TXY1NG02bzZVRGs?resourcekey=0-arAVTUfW9KRhN-irJchVKQ), [LSUN-Church](https://www.yf.io/p/lsun), [LSUN-Bedroom](https://www.yf.io/p/lsun) and [AFHQ-Dog](https://github.com/clovaai/stargan-v2) and [ImageNet](https://image-net.org/index.php) datastet directly, you can download them and fill their paths in `./configs/paths_config.py`.

### Colab Notebook [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1cvb5xZY7NU6tgkpuYwiLIAOr9KPYJkv_)
We provide a colab notebook for you to play with DiffusionCLIP! Due to 12GB of the VRAM limit in Colab, we only provide the codes of inference & applications with the fine-tuned DiffusionCLIP models, not fine-tuning code.
We provide a wide range of types of edits, but you can also upload your fine-tuned models following below instructions on Colab and test them.



## DiffusionCLIP Fine-tuning


To fine-tune the pretrained Diffusion model guided by CLIP, run the following commands:

```
python main.py --clip_finetune \
--config celeba.yml \
--exp ./runs/test \
--edit_attr neanderthal \
--do_train 1 \
--do_test 1 \
--n_train_img 50 \
--n_test_img 10 \
--n_iter 5 \
--t_0 500 \
--n_inv_step 40 \
--n_train_step 6 \
--n_test_step 40 \
--lr_clip_finetune 8e-6 \
--id_loss_w 0 \
--l1_loss_w 1
```
- You can use `--clip_finetune_eff` instead of `--clip_finetune` to save GPU memory.
- `config`: `celeba.yml` for human face, `bedroom.yml` for bedroom, `church.yml` for church, `afhq.yml` for dog face and , `imagenet.yml` for images from ImageNet.
- `exp`: Experiment name.
- `edit_attr`: Attribute to edit, you can use `./utils/text_dic.py` to predefined source-target text pairs or define new pair.
- Instead, you can use `--src_txts` and `--trg_txts`.
- `do_train`, `do_test`: If you finish training quickly withouth checking the outputs in the middle of training, you can set `do_test` as 1.
- `n_train_img`, `n_test_img`: # of images in the trained domain for training and test.
- `n_iter`: # of iterations of a generative process with `n_train_img` images.
- `t_0`: Return step in [0, 1000), high `t_0` enable severe change but may lose more identity or semantics in the original image.
- `n_inv_step`, `n_train_step`, `n_test_step`: # of steps during the generative pross for the inversion, training and test respectively. They are in `[0, t_0]`. We usually use 40, 6 and 40 for `n_inv_step`, `n_train_step` and `n_test_step` respectively.
- We found that the manipulation quality is better when `n_***_step` does not divide `t_0`. So we usally use 301, 401, 500 or 601 for `t_0`.
- `lr_clip_finetune`: Initial learning rate for CLIP-guided fine-tuning.
- `id_loss_w`, `l1_loss` : Weights of ID loss and L1 loss when CLIP loss weight is 3.



## Novel Applications

The fine-tuned models through DiffusionCLIP can be leveraged to perform the several novel applications.

### Manipulation of Images in Trained Domain & to Unseen Domain
![](imgs/app_1_manipulation.png)

You can edit one image into the CLIP-guided domain by running the following command:
```
python main.py --edit_one_image \
--config celeba.yml \
--exp ./runs/test \
--t_0 500 \
--n_inv_step 40 \
--n_test_step 40 \
--n_iter 1 \
--img_path imgs/celeb1.png \
--model_path checkpoint/neanderthal.pth
```
- `img_path`: Path of an image to edit
- `model_path`: Finetuned model path to use

You can edit multiple images from the dataset into the CLIP-guided domain by running the following command:
```
python main.py --edit_images_from_dataset \
--config celeba.yml \
--exp ./runs/test \
--n_test_img 50 \
--t_0 500 \
--n_inv_step 40 \
--n_test_step 40 \
--model_path checkpoint/neanderthal.pth
```


### Image Translation from Unseen Domain into Another Unseen Domain
![](imgs/app_2_unseen2unseen.png)


### Generation of Images in Unseen Domain from Strokes
![](imgs/app_3_stroke2unseen.png)
You can tranlate images from an unseen domain to another unseen domain. (e.g. Stroke/Anime ➝ Neanderthal) using following command:

```
python main.py --unseen2unseen \
--config celeba.yml \
--exp ./runs/test \
--t_0 500 \
--bs_test 4 \
--n_iter 10 \
--n_inv_step 40 \
--n_test_step 40 \
--img_path imgs/stroke1.png \
--model_path checkpoint/neanderthal.pth
```
- `img_path`: Stroke image or source image in the unseen domain e.g. portrait
- `n_iter`: # of iterations of stochastic foward and generative processes to translate an unseen source image into the image in the trained domain. It's required to be larger than 8.


## Finetuned Models Using DiffuionCLIP

We provide a [Google Drive](https://drive.google.com/drive/folders/1Uwvm_gckanyRzQkVTLB6GLQbkSBoZDZF?usp=sharing) containing several fintuned models using DiffusionCLIP.


## Additional Results

Here, we show more manipulation of real images in the diverse datasets using DiffusionCLIP where the original pretrained models
are trained on [AFHQ-Dog](https://arxiv.org/abs/1912.01865), [LSUN-Bedroom](https://www.yf.io/p/lsun) and [ImageNet](https://image-net.org/index.php), respectively.


<p align="center">
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/more_manipulation1.png" />
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/more_manipulation2.png" />
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/more_manipulation3.png" />
<img src="https://github.com/submission10095/DiffusionCLIP_temp/blob/master/imgs/more_manipulation4.png" />
</p>
35 changes: 35 additions & 0 deletions configs/afhq.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data:
dataset: "AFHQ"
category: "dog"
image_size: 256
channels: 3
logit_transform: false
uniform_dequantization: false
gaussian_dequantization: false
random_flip: true
rescaled: true
num_workers: 0

model:
type: "simple"
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 1, 2, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: [16, ]
dropout: 0.0
var_type: fixedsmall
ema_rate: 0.999
ema: True
resamp_with_conv: True

diffusion:
beta_schedule: linear
beta_start: 0.0001
beta_end: 0.02
num_diffusion_timesteps: 1000

sampling:
batch_size: 4
last_only: True
35 changes: 35 additions & 0 deletions configs/bedroom.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data:
dataset: "LSUN"
category: "bedroom"
image_size: 256
channels: 3
logit_transform: false
uniform_dequantization: false
gaussian_dequantization: false
random_flip: true
rescaled: true
num_workers: 0

model:
type: "simple"
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 1, 2, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: [16, ]
dropout: 0.0
var_type: fixedsmall
ema_rate: 0.999
ema: True
resamp_with_conv: True

diffusion:
beta_schedule: linear
beta_start: 0.0001
beta_end: 0.02
num_diffusion_timesteps: 1000

sampling:
batch_size: 4
last_only: True
35 changes: 35 additions & 0 deletions configs/celeba.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data:
dataset: "CelebA_HQ"
category: "CelebA_HQ"
image_size: 256
channels: 3
logit_transform: false
uniform_dequantization: false
gaussian_dequantization: false
random_flip: true
rescaled: true
num_workers: 0

model:
type: "simple"
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 1, 2, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: [16, ]
dropout: 0.0
var_type: fixedsmall
ema_rate: 0.999
ema: True
resamp_with_conv: True

diffusion:
beta_schedule: linear
beta_start: 0.0001
beta_end: 0.02
num_diffusion_timesteps: 1000

sampling:
batch_size: 4
last_only: True
35 changes: 35 additions & 0 deletions configs/church.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data:
dataset: "LSUN"
category: "church_outdoor"
image_size: 256
channels: 3
logit_transform: false
uniform_dequantization: false
gaussian_dequantization: false
random_flip: true
rescaled: true
num_workers: 0

model:
type: "simple"
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 1, 2, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: [16, ]
dropout: 0.0
var_type: fixedsmall
ema_rate: 0.999
ema: True
resamp_with_conv: True

diffusion:
beta_schedule: linear
beta_start: 0.0001
beta_end: 0.02
num_diffusion_timesteps: 1000

sampling:
batch_size: 4
last_only: True
35 changes: 35 additions & 0 deletions configs/imagenet.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data:
dataset: "IMAGENET"
category: "IMAGENET"
image_size: 512
channels: 3
logit_transform: false
uniform_dequantization: false
gaussian_dequantization: false
random_flip: true
rescaled: true
num_workers: 0

model:
type: "simple"
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 1, 2, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: [16, ]
dropout: 0.0
var_type: fixedsmall
ema_rate: 0.999
ema: True
resamp_with_conv: True

diffusion:
beta_schedule: linear
beta_start: 0.0001
beta_end: 0.02
num_diffusion_timesteps: 1000

sampling:
batch_size: 4
last_only: True
Loading

0 comments on commit 7573b88

Please sign in to comment.