This project aims to reproduce and add an extension to the paper of "InstanceDiffusion: Instance-level Control for Image Generation" by Wang et al. (2024). Please refer to our blogpost for detailed information on the implementation of our reproduction and extension of the InstanceDiffusion model.
- Linux or macOS with Python ≥ 3.8
- PyTorch ≥ 2.0 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this.
- OpenCV ≥ 4.6 is needed by demo and visualization.
conda create --name instdiff python=3.8 -y
conda activate instdiff
pip install -r requirements.txt
Run the following script to download the MSCOCO dataset (to src/lib/instancediffusion/datasets/
):
bash src/scripts/download_coco.sh
In order to run the Inference Demos of InstanceDiffusion locally, we provide src/lib/instancediffusion/inference.py
and multiple json files in src/lib/instancediffusion/demos
, specifying text prompts and location conditions for generating specific images. In order to run these demos, please install the pretrained InstanceDiffusion from Hugging Face or Google Drive and SD1.5 and place them under the src/lib/instancediffusion/pretrained
folder. To run the different inference demos locally, use the bash scripts specified in src/scripts/inference_demos/
by running it using the following format:
bash src/scripts/inference_demos/bbox_demos/demo_cat_dog_robin.sh
bash src/scripts/inference_demos/iterative_generation/demo_iterative_r1.sh
bash src/scripts/inference_demos/point_demos/demo_corgi_kitchen.sh
The demo outputs are saved in the folder src/data/demo_outputs/
.
InstanceDiffusion enables image compositions with a granularity that ranges from entire instances to individual parts and subparts. The placement of these parts and subparts can inherently modify the object's overall pose.
instdiff-turnhead-ezgif.com-video-speed.mp4
InstanceDiffusion supports image generation by using points, with each point representing an instance, along with corresponding instance captions.
InstanceDiffusion supports iterative image generation with minimal changes to pre-generated instances and the overall scene. By using the same initial noise and image caption, InstanceDiffusion can selectively introduce new instances, replace existing ones, reposition instances, or adjust their sizes by modifying the bounding boxes.
instdiff-iterative.mp4
Our approach automates the generation of image descriptions and bounding boxes using a Large Language Model (LLM). This method enhances the efficiency of the InstanceDiffusion Model, which supports precise instance-level control and flexible instance location specifications. For a detailed explanation, refer to the section Leveraging LLM's for Modular Efficiency in InstanceDiffusion. This following instructions demonstrate how to set up and use our LLM submodule to generate input data for InstanceDiffusion.
conda deactivate
conda create --name instdiff_llm python=3.8 -y
conda activate instdiff_llm
pip install -r requirements_llm.txt
python src/lib/llm_submodule/chatgpt/chatgpt_generate_input.py
You will be prompted to enter your ChatGPT API key. To request an API key for an existing user account, click here. Next, specify the number of image descriptions you would like to generate using ChatGPT. The requested image descriptions will be generated and saved in a folder (in src/lib/llm_submodule/chatgpt/chatgpt_data) named with a timestamp. This folder name will be printed on the terminal.
Next, create the images from the LLM-generated input descriptions.
./src/lib/llm_submodule/chatgpt/create_llm_images.sh
You will be prompted to enter the folder name (timestamp) where your input descriptions were saved. This ensures that the generated images correspond to the correct session. The images will now be generated.
Navigate to the output directory to view the generated images:
cd src/lib/llm_submodule/chatgpt/chatgpt_output
Inside, you will find a folder named with the same timestamp, containing:
- A visualisation of the ChatGPT-defined bounding boxes.
- The initial generated image based on the text prompt and instance-level conditions (bounding boxes), created using the main diffusion model without refinement.
- Initial image after further quality enhancement using the SDXL refiner.
By following these steps, you will successfully generate and inspect images using the GPT-4 LLM submodule integrated with the InstanceDiffusion Model. To read about a quality assessment by inspection, click here. The next section will proceed to explain the evaluation of the LLM-based images using CogVLM.
To retrieve the CLIP-score between the global input text prompt and the generated image, run the following command in the root directory:
conda activate instdiff
python src/lib/llm_submodule/chatgpt/eval_clip.py
Evaluating the alignment of the generated photos with the bounding boxes made by ChatGPT using CogVLM. Once the created photos are fed into CogVLM, it uses the bounding boxes to identify and outline the instances (predicted bounding boxes) within the images. The places where the instances have been deployed by the Instance Diffusion model are represented by these expected bounding boxes. On the other hand, ChatGPT's bounding boxes, which show the locations intended for instance generation, act as the ground truth.
For evaluation, the MSCOCO dataset is used. To evaluate, first make sure the MSCOCO dataset is installed using the script described in the environment installation above. Ensure the data is organized as followed:
coco/
annotations/
instances_val2017.json
images/
val2017/
000000000139.jpg
000000000285.jpg
...
Moreover, the customized instances_val2017.json file needs to be downloaded. This resizes all images to 512x512 and adjusts the corresponding masks/boxes accordingly.
To reproduce the results from the paper, please refer to the job files in src/scripts/jobs/reproduction
. We reproduced three categories of evaluation studies from the paper: different location formats used as input when generating images (eval_mask & eval_box), results from using scribble- and point-based image generation (eval_PiM_point & eval_PiM_scribble) and attribution binding (eval_att_textures & eval_att_colors). All the scripts are found in the reproduction folder and can be run on the Snellius Cluster as follows:
To reproduce the results from the InstanceDiffusion paper, we ran multiple job files located in the folder src/scripts/jobs
. These jobs were run on the Snellius Cluster, provided by the UvA. In order to reproduce our results in full, make sure to run the different scripts in the following manner:
git clone https://github.com/Jellemvdl/InstanceDiffusion-extension.git
cd InstanceDiffusion-extension/
To install the requirements, the coco dataset (to src/lib/instancediffusion/datasets/
) and the pretrained models (to src/lib/instancediffusion/pretrained/
), run the following:
sbatch src/jobs/install_env.job
In order to replicate the results from the paper, run each evaluation in src/scripts/jobs/reproduction
jobs as follows:
sbatch src/scripts/jobs/reproduction/eval_box.job
As part of our reproduction study, we succesfully replicated the YOLO results achieved by the original authors for different location formats as input when generating images for Boxes and Instance masks:
Method | APbox | APbox50 | ARbox |
---|---|---|---|
InstanceDiffusion | 38.8 | 55.4 | 52.9 |
Our Reproduction | 49.9 | 66.8 | 68.6 |
Difference | +11.1 | +11.4 | +15.7 |
Table 1. Evaluating different location formats when generating images of reproduction experiments using Bounding Boxes as input. |
Method | APmask | APmask50 | ARmask |
---|---|---|---|
InstanceDiffusion | 27.1 | 50.0 | 38.1 |
Our Reproduction | 40.8 | 63.5 | 56.0 |
Difference | +13.7 | +13.5 | +17.9 |
Table 2. Evaluating different location formats when generating images of reproduction experiments using Instance Masks as input. |
In the same manner we reproduced the PiM values for scribble-/point-based image generation:
Method | Points | Scribble | ||||
---|---|---|---|---|---|---|
PiM | PiM | |||||
InstanceDiffusion | 81.1 | 72.4 | ||||
Our Reproduction | 33.66 | 23.56 | ||||
Difference | -47.44 | -48.84 | ||||
Table 3. Evaluating different location formats as input when generating images of reproduction experiments for points and scribbles. |
Moreover, succesfully replicated the attribute binding results for colors and texture:
Methods | Color | Texture | ||||
---|---|---|---|---|---|---|
Acccolor | CLIPlocal | Acctexture | CLIPlocal | |||
InstanceDiffusion | 54.4 | 0.250 | 26.8 | 0.225 | ||
Our Reproduction | 53.3 | 0.248 | 26.9 | 0.226 | ||
Difference | -1.1 | -0.002 | +0.1 | +0.001 | ||
Table 4. Attribute binding reproduction results for color and texture. |