Thank you for your excellent work on this project! I’ve come across some discussions about unconditional embeddings, but I have a specific question regarding their implementation.
What is the difference between setting CLIP image embeddings to zero tensors directly and passing a black image(or a noised one) to the CLIP vision model? More specifically, what semantic meaning (if any) is encoded in the zeroed image embeddings?