-
Notifications
You must be signed in to change notification settings - Fork 403
Description
HI
I'm currently trying to fine-tune using the ip-adapter-plus_sdxl_vit-h.safetensors checkpoint and encountered a shape mismatch error related to the image encoder output.
When I load the encoder from:
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
"h94/IP-Adapter", subfolder="models/image_encoder"
)
This encoder gives image_embeds with shape [1, 1024]. However, the Resampler in ip-adapter-plus_sdxl_vit-h expects an input shape of [1, 1280], which causes this runtime error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1024 and 1280x1280)
From the naming, I assumed that ip-adapter-plus_sdxl_vit-h would expect embeddings from the ViT-H encoder (projection_dim=1024), but it seems the Resampler checkpoint expects a projection_dim=1280 (perhaps from ViT-bigG?).
Which exact image encoder model (ViT-H-14 or ViT-bigG-14) should be used with ip-adapter-plus_sdxl_vit-h.safetensors?