Confused with image_features minus text_features #48

CharlesGong12 · 2024-06-23T06:21:47Z

Hi thanks for your amazing work!
I am confused with the subtraction operation image_features minus text_features. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features? It seems that image features and text features are not in the same space.

The text was updated successfully, but these errors were encountered:

CharlesGong12 · 2024-06-23T06:24:36Z

The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]

haofanwang · 2024-06-24T16:11:56Z

Good eye! Thanks for your feedback! @CharlesGong12

Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.

Hope this helps. Please let me know if you have further question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused with image_features minus text_features #48

Confused with image_features minus text_features #48

CharlesGong12 commented Jun 23, 2024

CharlesGong12 commented Jun 23, 2024

haofanwang commented Jun 24, 2024

Confused with image_features minus text_features #48

Confused with image_features minus text_features #48

Comments

CharlesGong12 commented Jun 23, 2024

CharlesGong12 commented Jun 23, 2024

haofanwang commented Jun 24, 2024