add PAG support #7944

yiyixuxu · 2024-05-14T10:24:06Z

Notes on implementation

separate pipeline class

created a separate pipeline group for PAG so that we are able to support it (and many more such features in the future) while keeping our SD and SDXL pipelines lightweight for the research community

`PAGMixin`

PAGMixin extracts away all PAG-related logic so that we are able to keep the PAG pipeline structure consistent with the rest of the pipelines. It make it easier to read, and also easier to integrate and maintain

`AutoPipeline` API

You can pass enable_pag =True to automatically create a pipeline with PAG enabled based on the task you specified and the checkpoint you provided. Under the hood, it creates the corresponding PAG pipeline. A few examples

# SDXL + PAG: 
AutoPipelineForText2Image.from_pretrained(repo_id, enable_pag=True ...)

# SDXL + controlnet + PAG: 
AutoPipelineForText2Image.from_pretrained(repo_id, controlnet=controlnet, enable_pag=True ...)

# SDXL Inpainting + PAG: 
AutoPipelineForInpainting.from_pretrained(repo_id, enable_pag=True....)

from_pipe API also works and works just intuitively (I hope). A few examples:

# StableDiffusionXLControlNetPipeline to StableDiffusionXLPAGPipeline
AutoPipelineForText2Image.from_pipe(pipe_controlnet, controlnet=None, enable_pag=True)`

# StableDiffusionXLPAGPipeline to StableDiffusionXPipeline: 
AutoPipelineForText2Image.from_pipe(pipe_pag, enable_pag=False)

`pag_applied_layers`

you can set pag_applied_layers when you create the pipeline, e.g.

AutoPipelineForText2Image.from_pretrained(repo_id, enable_pag=True, pag_applied_layers =["down.block_1, "up.block_0.attentions_0"])

you can use set_pag_applied_layers to update these layers after the pipeline has been created

pipe_pag.set_pag_applied_layers("mid")

The accepted value for set_pag_applied_layers is either a single string or a list of strings, you can
- set PAG on all the down blocks, middle blocks or up blocks with inputs such as "down", "mid", "up"
- set PAG on specific down, middle, up blocks with inputs such as "down.block_0", "up.block_1"
- set PAG on specific attention model on specific down, middle, up blocks with inputs such as "down.block_0.attentions_0"
The user will get an error if inputs are not formatted correctly or the layer they specified does not exist in the model.

other notes:

I refactored prepare_ip_adapter_image_embeds a little bit so that we duplicate inputs for CFG only once in the end, that's why a lot of the files got changed. you only need to look at the pag folder and auto_pipeline.py file under pipelines folder when reviewing this PR

Usage Examples

SDXL + PAG

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    pag_applied_layers = ["down.block_2", "up.block_1.attentions_0"],
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()


pag_scales =  [0.0, 3.0]
guidance_scales = [0.0, 7.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for guidance_scale in guidance_scales:
        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=guidance_scale,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(guidance_scales)).save("yiyi_test_2_out.png")

SDXL + PAG + IP-Adapter

works with ip-adapter now thanks to @sunovivid

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    enable_pag=True,
    torch_dtype=torch.float16
).to("cuda")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")

pag_scales = [0.0, 3.0]
ip_adapter_scales = [0.0, 0.6]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.set_ip_adapter_scale(ip_adapter_scale)
        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("yiyi_test_4_out.png")

SDXL Inpainting + PAG

# pag integration test: inpaint
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForInpainting.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    #pag_applied_layers = ["down.block_2", "up.block_1.attentions_0"],
    pag_applied_layers = "mid",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench"


pag_scales =  [0.0, 3.0]
guidance_scales = [0.0, 7.5]

grid = []
for pag_scale in pag_scales:
    for guidance_scale in guidance_scales:
        generator = torch.Generator(device="cpu").manual_seed(1)
        images = pipeline(
            prompt=prompt,
            image=init_image,
            mask_image=mask_image,
            strength=0.8,
            num_inference_steps=50,
            guidance_scale=guidance_scale,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(guidance_scales)).save("yiyi_test_4_out.png")

SDXL + ControlNet + PAG

# pag integration test: controlnet + pag
from diffusers import AutoPipelineForText2Image, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image
import numpy as np
import torch

import cv2
from PIL import Image

# initialize the models and pipeline
controlnet_conditioning_scale = 0.5  # recommended for good generalization
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    enable_pag=True,
    pag_applied_layers = "mid",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
print(f" pipeline: {pipeline.__class__}")


# download an image
image = load_image(
    "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)
# get canny image
image = np.array(image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches"


pag_scales =  [0.0, 3.0]
guidance_scales = [0.0, 7.5]


grid = []
for pag_scale in pag_scales:
    for guidance_scale in guidance_scales:
        generator = torch.Generator(device="cpu").manual_seed(1)
        images = pipeline(
            prompt=prompt,
            controlnet_conditioning_scale=controlnet_conditioning_scale,
            image=canny_image,
            num_inference_steps=50,
            guidance_scale=guidance_scale,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(guidance_scales)).save("yiyi_test_5_out.png")

SDXL Img2img + PAG

# test pag img2img
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid
import torch


pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    pag_applied_layers = ["mid"],
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

pag_scales =  [0.0, 4.0]
guidance_scales = [0.0, 7.0]

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
init_image = load_image(url)
prompt = "a dog catching a frisbee in the jungle"

grid = []

for pag_scale in pag_scales:
    for guidance_scale in guidance_scales:
        generator = torch.Generator(device="cpu").manual_seed(0)
        image = pipeline(
            prompt, 
            image=init_image, 
            strength=0.8, 
            guidance_scale=guidance_scale, 
            pag_scale=pag_scale,
            generator=generator).images[0]
        
        grid.append(image)

make_image_grid(grid, rows=len(pag_scales), cols=len(guidance_scales)).save("yiyi_test_9_out.png")

HuggingFaceDocBuilderDev · 2024-05-14T10:29:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2024-05-14T10:30:15Z

@asomoza can you test it out?
I tried to make it work with ip-adapter but I don't think it works - do you know if PAG works with ip-adapter?
what other pipelines should I add this too for testing? (

yiyixuxu · 2024-05-14T19:38:02Z

cc @HyoungwonCho for awareness
also question: does PAG work with IP-adapter?

src/diffusers/pipelines/pag_utils.py

asomoza · 2024-05-15T06:59:35Z

I've doing some tests and I like it a lot.

no PAG	PAG CFG

I think it makes the robot more coherent and it fixes some of the wrong details, but it makes it less "humanoid" and loses a bit of the cinematic look.

I'm still deciding if I like more if we could use a layer or block naming like with the loras and ip_adapter or if pag_applied_layers and pag_applied_layers_index is better. I'll give some examples to evaluate this.

So lets say, I want to test it with what I normally use for the pose in the loras which are all the layers in the down block 2, with the current system I need to do this:

pag_applied_layers_index = ["d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

the equivalent could be this:

pag_applied_layers = {"down": ["block_2"]}

or for example the last attention block which is what we can associate to the composition with IP Adapters:

pag_applied_layers_index = ["d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

for this, a equivalent could be:

pag_applied_layers = {"down": "block_2": "attentions_1"}

down_block_2	down_block_2_attentions_1

I don't know if going as granular as each of the layers could bring a benefit, even someone like me that likes full control won't go as far as to try to control an image with 70 different layers on top of everything else.

As an example, as an advanced user, I want to use PAG to make the image better but without the robot losing it's humanoid form and the cinematic look.

Doing some quick tests, I found that for this particular image, this works really well:

pipeline.enable_pag(
    pag_scale=3.0,
    pag_applied_layers=None,
    pag_applied_layers_index=[
        "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "u0", "u1", "u2", "u3", "u4", "u5", "u6", "u7", "u8", "u9",
    ],
)

which in the lora format would be like this:

pag_applied_layers = {"down": "block_2", "up": "block_1": "attentions_0"}

Hope this example is somewhat clear, and also we can see that it matters a lot, the image is a lot better with this.

I'll do tests with the other use cases later, specially with the upscaler.

HyoungwonCho · 2024-05-15T13:26:40Z

@yiyixuxu @asomoza Hello, I was impressed by the various experiments you conducted using PAG!
We are also discussing the use of PAG in various tasks, as well as layer/scale selection.

Since the guidance framework of PAG itself is simple, it seems quite possible to use it in conjunction with other modules like the IP-Adapter you mentioned. However, we have not yet implemented and experimented with it directly, so we have not confirmed whether there is a significant performance improvement when used together. If possible, we will conduct additional experiments in the future.

Thank you for your interest in our research.

KKIEEK · 2024-05-15T19:36:19Z

Thank you for the great work!
However, I encountered the following issue when using StableDiffusionXLControlNetPipeline with CFG and PAG:

  File ".../.env/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 798, in forward
    sample = sample + controlnet_cond
             ~~~~~~~^~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

I solved it by adding a new parameter do_perturbed_attention_guidance and appending the following lines in the prepare_image method.

        if do_classifier_free_guidance and do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 3)
        elif do_classifier_free_guidance and not guess_mode:
            image = torch.cat([image] * 2)
        elif do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 2)

yiyixuxu · 2024-05-15T23:27:10Z

@KKIEEK
thanks! I added your change:)

jorgemcgomes · 2024-05-16T10:43:18Z

Just leaving a brief report of my findings with PAG and Diffusers (I already had it integrated in my pipelines before this PR):

It generally works very very well when properly tuned. Almost looks like a significant model upgrade.
I'm using it with models derived from SD2.1.
Implemented it successfuly in text-to-image, image-to-image, controlnet, unclip, and inpainting pipelines.
I get the best results with values around guidance_scale=7 and pag_scale=3
The layers to which it is applied makes a huge difference on the output. It's the difference between garbage and excellent. Adding or removing a single layer can make it or break it.
For example, for SD2.1, I found that with just [m0] the effect was too subtle, [d4, d5, m0] was overcooked, [d5, m0] seems to work best; adding any up layers typically screws up the results [d5, m0, u0].
The applied layers will obviously change in different model architectures. And I imagine that the "optimal" layers might even change with fine-tunes. I couldn't replicate the optimal parameters described in the paper (for SD1.5), with SD2.1 (which has the same unet architecture).

yiyixuxu · 2024-05-20T17:22:40Z

@jorgemcgomes thanks!

sunovivid · 2024-05-24T02:59:30Z

Hello. I'm an author of PAG. Thank you for your insightful opinions and cool implementation. Is there anything currently in progress? We are excited to see that PAG is gaining popularity within the community and being utilized in various workflows. Especially in ComfyUI, PAG nodes are used in diverse workflows.

(Some workflows using PAG in ComfyUI:
https://www.reddit.com/r/StableDiffusion/comments/1c68qao/perturbedattention_guidance_really_helps_with/
https://civitai.com/models/141592/pixelwave
https://civitai.com/models/413564/cjs-super-simple-high-detail-cosxl-and-pag-workflow
https://www.reddit.com/r/StableDiffusion/comments/1c4cb3l/improve_stable_diffusion_prompt_following_image/
https://www.reddit.com/r/StableDiffusion/comments/1ck69az/make_it_good_options_in_stable_diffusion/
https://stable-diffusion-art.com/perturbed-attention-guidance/)

However, in Diffusers, it seems somewhat challenging to try creative combinations as the pipelines are separated.
( a collection of PAG pipelines with Diffusers: https://x.com/multimodalart/status/1788844183760847106 )

Therefore, the MixIn approach taken in this PR appears to be a very effective solution. However, it seems a bit awkward to call enable_pag every time to adjust the pag scale. Ideally, it would be more natural to set the pag_scale when calling the pipeline after enable_pag (similar to setting ip_adapter_image=image after in load_ip_adapter). So, I'm exploring a better design for this.

Additionally, since there are many users who want compatibility with IP-adapter, now I have time and would like to work on making it compatible with IPAdapter. I'm curious if there's any related progress about component design or IP-adapter compatibility.

Thank you!

yiyixuxu · 2024-05-28T22:18:57Z

@sunovivid thanks for the message!
this is not the finalized design just something we can use to test out compatibility of PAG - we will iterate on the final design

for IP-adapter, it will be super cool if we can make it work! I'm not aware of any related progress so would really appreciate if you are able to find time to work on this! maybe we can just pick one of the pipelines from this PR (with the mixin) and make it work with ip_adpter_image input?

sunovivid · 2024-06-02T18:24:28Z

@yiyixuxu Hi! I made a working version of PAG + IP-adapter. Can you check the PR?

docs/source/en/api/pipelines/pag.md

sayakpaul · 2024-06-25T08:26:22Z

docs/source/en/using-diffusers/pag.md

+
+This guide will show you how to use PAG for various tasks and use cases.
+
+


Maybe include a mention of PAG pipeline page here as well?

docs/source/en/using-diffusers/pag.md

src/diffusers/models/attention_processor.py

sunovivid · 2024-06-25T14:53:58Z

@yiyixuxu

Thank you very much for your efforts in integrating PAG with diffusers. Despite the enormous workload, we are very grateful that you have finally completed it! This is truly an amazing job. Concerns about propagating the changes have also been somewhat resolved. It seems that other pipelines are being managed quite well.

We plan to see how PAG works in a DiT-based architecture. Additionally, we intend to try several new perturbations. There have been some attempts in the community regarding this, but there doesn't seem to be a clear conclusion yet. This is a very exciting direction!

stevhliu

it is pretty much ready to merge now - do you want to take a look at the doc? feel free to refactor later too

LGTM, feel free to merge! 👍

docs/source/en/using-diffusers/pag.md

Co-authored-by: Sayak Paul <[email protected]>

Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Steven Liu <[email protected]>

Co-authored-by: Sayak Paul <[email protected]>

peki12345 · 2024-07-10T10:43:46Z

@yiyixuxu Hi, why did PAG not work when I executed the first piece of code in the latest diffusers?
I received "Keyword arguments {'enable_pag': True, 'pag_applied_layers': ['down.block_2', 'up.block_1.attentions_0']} are not expected by StableDiffusionXLPipeline and will be ignored."

sunovivid · 2024-07-10T10:53:39Z

@peki12345 I think it is because you run diffusers v0.29 (the 'stable' latest). PAG is on main branch but has not released as stable version yet.

peki12345 · 2024-07-10T12:01:35Z

@peki12345 I think it is because you run diffusers v0.29 (the 'stable' latest). PAG is on main branch but has not released as stable version yet.

I got it, thanks.

* first draft --------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Junhwa Song <[email protected]> Co-authored-by: Ahn Donghoon (안동훈 / suno) <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Steven Liu <[email protected]>

yiyixuxu added 4 commits May 13, 2024 08:40

first draft

a6a0429

refactor

3605df9

update

f94376c

up

54c3fd6

yiyixuxu and others added 7 commits May 14, 2024 12:32

style

f571430

style

91d0a5b

update

01585ab

inpaint + controlnet

03bdbcd

Merge branch 'pag' of github.com:huggingface/diffusers into pag

b662207

style

1fb2c33

up

219f4b9

yiyixuxu commented May 14, 2024

View reviewed changes

src/diffusers/pipelines/pag_utils.py Outdated Show resolved Hide resolved

Update src/diffusers/pipelines/pag_utils.py

5641cb4

yiyixuxu added contributions-welcome help wanted Extra attention is needed labels May 15, 2024

sunovivid mentioned this pull request May 15, 2024

support IPAdapter? sunovivid/Perturbed-Attention-Guidance#4

Closed

fix controlnet

8950e80

sunovivid mentioned this pull request Jun 2, 2024

fix compatability issue between PAG and IP-adapter #8379

Merged