Extrapolation of InstantID Core Idea to Arbitrary Visual Details #77
Replies: 1 comment 1 reply
-
Hah I swear I didn't see this when typing this discussion up, but this paper was apparently released just today. CreativeSynth (arxiv, repo). It does something pretty similar to what I described above, but is focused more on the artistic reinterpretation of information from multiple images rather than the fidelity of specific visual details. Edit: BootPIG (arxiv) released today as well, is along the same vein: injecting specific visual details in a reference image into another image (kind of, relying on the text prompt to initialize the starting image). They don't use cross attention but instead a stacked version of the projected key-value spaces of the reference image and the denoising image when calculating the attention matrix. Edit2: Actually, AnyDoor (arxiv, repo) is extremely close to what I'm describing above. They use DINOv2 as their visual detail encoder, and then inject those details directly in place of the text embedding. |
Beta Was this translation helpful? Give feedback.
-
Some context of why I'm posting here: Diffusers discussion
Hello everyone!
Just wanted to say that InstantID looks super impressive, amazing job.
I've been looking into how I might solve a similar-ish problem. From the post above, the main summary idea is the following: My overall goal is to produce a generative image model that, during inference, takes in
and then outputs a new image (which I'll call the target image) which looks very similar to the starting image but has been changed by the conditional information, injected in the "correct" way, taught during training.
InstantID does some version of this, but also still utilizing text as the main source of surrounding image generation and style, and then only for facial details. I'm going to summarize some of the InstantID work in my own words, so that I can more easily make the comparison for an explanation of what I want after.
In the paper, you motivate this work with the following:
However, a notable challenge remains: generating customized images that accurately preserve the intricate identity details of human subjects. This task is particularly demanding, as human facial identity (ID) involves more nuanced semantics and requires a higher standard of detail and fidelity compared to general styles or objects, which primarily focus on coarse-grained textures and colors. Existing vanilla text-to-image models, dependent on detailed textual descriptions, fall short in achieving strong semantic relevance in customized generation
. This can generally be summarized as "words alone are not sufficient to adequately capture the details necessary to reproduce a person's face". This makes sense, as humans have evolved to pay particular attention to the very subtle difference in the human face for social and communication reasons. So the specific facial details that make one person's face that particular person's face (not just ANY face) can generally not be captured in words, or rather it would be exceedingly difficult to do so.Okay, so how do we fix this? One way is to embed the face into some space that DOES adequately describe the essential features of the face (unlike text space, or even shared image-text space like CLIP embedding space). Lucky for the authors, this is generally a solved problem! There are off-the-shelf solutions to embedding facial-identity features in some space, that we KNOW is generally sufficient to differentiate between faces. So the next step is how do we inform the model of these very important identity features? This is arguably the key question and challenge, as it is bolded in the paper.
How do we effectively inject the identity features into the diffusion models?
One of the key components to the solution that I really like is the following:
We eliminate the text prompts and use ID embedding as conditions for cross-attention layers in the ControlNet.
That is, the facial details are directly being attended to (injected) in the IdentityNet. Indeed,In IdentityNet, the generation process is fully guided by face embedding without any textual information
. This makes so much sense! Why do we need to attend to text space in the ControlNet when we've already determined that text space was inadequate to transfer the facial details.So now that I've summarized InstantID in this way, I was wondering if this could apply to similar things, but for arbitrary visual details. For example, you can imagine that there are details in some image where "words alone are not sufficient to adequately capture the details necessary to reproduce those details". For example, that image isn't just of any dog, that is your dog. That image isn't just of any worn down chair, that's your worn down chair. The fidelity, the accuracy to this particular representation, can't be captured in words, but only in some reference image.
Thus, instead of embedding the facial details with a face encoder (and then attending to those), embed the specific visual details of any image with a visual detail encoder, for example some intermediate layer of VGG19 or ResNet50, and then attend to those specific visual details, as a means of transferring the visual information to your diffused image, bypassing entirely the need to construct a text-based description of it or to be projected into a shared text-image space that is still not an adequate description.
This is, ultimately, what I am after. Identifying, extracting, and embedding specific visual details and being able to "transfer" those uniquely identifying details over to a starting image. In the way I've described it above, this goal is not too dissimilar from InstantID. The failure point, of course, is that a human face encoder is an extremely specific and specialized visual detail encoder, and it might be exceedingly difficult to have a model properly figure out what it should do with the visual details during training unless you have a large enough dataset to inform your model of what to do with it.
Beta Was this translation helpful? Give feedback.
All reactions