Which image encoder to use with ip-adapter-plus_sdxl_vit-h.safetensors?

HI

I'm currently trying to fine-tune using the ip-adapter-plus_sdxl_vit-h.safetensors checkpoint and encountered a shape mismatch error related to the image encoder output.
When I load the encoder from:
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", subfolder="models/image_encoder"
)

This encoder gives image_embeds with shape [1, 1024]. However, the Resampler in ip-adapter-plus_sdxl_vit-h expects an input shape of [1, 1280], which causes this runtime error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1024 and 1280x1280)

From the naming, I assumed that ip-adapter-plus_sdxl_vit-h would expect embeddings from the ViT-H encoder (projection_dim=1024), but it seems the Resampler checkpoint expects a projection_dim=1280 (perhaps from ViT-bigG?).

Which exact image encoder model (ViT-H-14 or ViT-bigG-14) should be used with ip-adapter-plus_sdxl_vit-h.safetensors?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Which image encoder to use with ip-adapter-plus_sdxl_vit-h.safetensors? #470

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Which image encoder to use with ip-adapter-plus_sdxl_vit-h.safetensors? #470

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions