Skip to content

Which image encoder to use with ip-adapter-plus_sdxl_vit-h.safetensors? #470

@EPinzuti

Description

@EPinzuti

HI

I'm currently trying to fine-tune using the ip-adapter-plus_sdxl_vit-h.safetensors checkpoint and encountered a shape mismatch error related to the image encoder output.
When I load the encoder from:
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
"h94/IP-Adapter", subfolder="models/image_encoder"
)

This encoder gives image_embeds with shape [1, 1024]. However, the Resampler in ip-adapter-plus_sdxl_vit-h expects an input shape of [1, 1280], which causes this runtime error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1024 and 1280x1280)

From the naming, I assumed that ip-adapter-plus_sdxl_vit-h would expect embeddings from the ViT-H encoder (projection_dim=1024), but it seems the Resampler checkpoint expects a projection_dim=1280 (perhaps from ViT-bigG?).

Which exact image encoder model (ViT-H-14 or ViT-bigG-14) should be used with ip-adapter-plus_sdxl_vit-h.safetensors?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions