Controllable Text-to-Image Generation with Customized Guidance on Appearance and Position (Stable diffusion)
This implementation is designed to control a specific target that appears in the prompt, focusing on its appearance and position. It allows you to transfer the appearance from one generated image to a new generated image and specify the target position using box coordinates. By extracting features from Cross Attention layers as guidance, this method operates effectively without the need for any model training or fine-tuning, which is different from LoRA and ControlNet. This approach leverages the inherent capabilities of the Cross Attention mechanism to ensure accurate and efficient feature transfer and positioning.
Given one ONE reference image, the method can almost retain the details of the target concept.
The method can also be applied to features other than the concept appearance, such as position, size, and so on. The only difference is the design of loss function (energy function).

It is also easy to learn multiple reference image on seperate targets, and combine them into one generated image.

The .ipynb contains the pipeline and the edited unet is in my_model. Please load the Unet from my_model directory, rather than the diffuser package. Please see the details in Poster.pdf
The code is based on the implementation of the following papers:
Training-free layout control with cross-attention guidance.
Diffusion self-guidance for controllable image generation.
Prompt-to-prompt image editing with cross-attention control.