You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thanks for your amazing work!
I am confused with the subtraction operation image_features minus text_features. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features? It seems that image features and text features are not in the same space.
The text was updated successfully, but these errors were encountered:
The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]
Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.
Hope this helps. Please let me know if you have further question.
Hi thanks for your amazing work!
I am confused with the subtraction operation
image_features minus text_features
. The image features is encoded byCLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use
image_features minus text_features
? It seems that image features and text features are not in the same space.The text was updated successfully, but these errors were encountered: