Skip to content

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hanzhn opened this issue Apr 7, 2025 · 34 comments
Open

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

hanzhn opened this issue Apr 7, 2025 · 34 comments
Labels
help wanted Extra attention is needed

Comments

@hanzhn
Copy link
Collaborator

hanzhn commented Apr 7, 2025

We are now working on the final 14b version, and we appreciate the voices from users and developers. Let us know if you have experience issues with VACE-Wan2.1-1.3B-Preview or other interesting new ideas that you believe VACE can achieve.

Describe the task you are doing with VACE in detail below, and you may get improvements on it with the final version VACE!

Please give us more feedback!

@hanzhn hanzhn added the help wanted Extra attention is needed label Apr 7, 2025
@Inner-Reflections
Copy link

Inner-Reflections commented Apr 7, 2025

Amazing work! I have been using mostly pose and depth control to good effect. I have found it not to work too well with sone existing loras especially those trained on styles. would love to have a tile/upscale method.

@Djanghost
Copy link

Hi there,

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

This model is for me the best video model we currently have in the opensource world, so a huge thanks for your work.

@kijai
Copy link

kijai commented Apr 7, 2025

Hey, thanks for the amazing model(s)!

One thing that has come up couple of times in feedback, people would like some way to better choose which control type is actually used, this is referring to the method used in union controlnets, I don't know if it's possible here though.

Another thing that may already be possible, but I'm unsure how to do would be adjusting the strength of control and reference separately. And related to that, using multiple separate VACE inputs, for which I have used workaround to simply apply VACE multiple times, but it's somewhat clumsy.

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

@nokai
Copy link

nokai commented Apr 7, 2025

It might be important to be able to edit the strength, so I could blend the original with the reference.

It would also be nice to be able to add more than one reference for different purposes.

I also thought it would be nice if we could do it with styles, meaning, for example, if you add a comic book image, it would take the style of the image and not just what you see.

I don't know if I'm perhaps rambling. Thank you very much for your work.

@KytraScript
Copy link

First, thank you! This is one of the coolest toolkits to happen for opensource! The work you're doing is incredible.
Second, I'd love to see a task focused on lip sync / facial expressions.

@hooplyha
Copy link

hooplyha commented Apr 7, 2025

For me a focus on maintaining identity consistency would be amazing! I find identity often changes through generations. Thank you for your excellent work 👍

@deepbeepmeep
Copy link

Many thanks for you great model.
Here are my observations:

  • identity preservation could be better, sometime it is a complete miss
  • I agree with Kijai, it is not obvious which method the model is going to use. It would be great if we could have a greater control. This could fix some bad interpretation of the targetted task (see Video extension issue below)
  • For some reason the output of a Video Extension (just give the first frames and the model extrapolates the rest) has it colors completely washed out . Maybe my inputs are wrong : I gave the source video as a control video for the first frames (appended with 127 white frames to complete) and a video mask which is black before the first n frames and white beyond . Am I doing anything wrong ?

@able2608
Copy link

able2608 commented Apr 8, 2025

  • For some reason the output of a Video Extension (just give the first frames and the model extrapolates the rest) has it colors completely washed out . Maybe my inputs are wrong : I gave the source video as a control video for the first frames (appended with 127 white frames to complete) and a video mask which is black before the first n frames and white beyond . Am I doing anything wrong ?
  1. It seems that the model wants gray to fill the mask out part.
  • I agree with Kijai, it is not obvious which method the model is going to use. It would be great if we could have a greater control. This could fix some bad interpretation of the targetted task (see Video extension issue below)
  1. I used OmniGen before, and that particular system has tags that you can use to refer to a particular control signal. It might not be possible to do so without changing the structure of the current model (OmniGen is a single transformer, and both text and image prompts are processed with the same model), but perhaps being able to use instruction-based prompt (like "Generate a video of the guy in black Tshirt in the left reference image giving a speech on the stage in the right of the reference image, with the pose of the pose control video"). This can give the user a more fine-grained control over what the result should look like, instead of letting the model do the guess work. If not using instruction-based prompt is due to the base model not being fine-tuned (I imagine if so then it might not perform well with these kind of prompts), then perhaps give the model the ability to comprehend reference in the prompt (something like "A video of the guy in black Tshirt in the left reference image giving a speech on the stage in the right of the reference image, in the pose of the pose control video") will be good enough.

  2. It will be great if the 1.3b version is also updated. Though 1.3b is tiny compared to other video models, it can already do a lot and the quality is not bad at all, plus it is much more consumer-GPU friendly to run. Paired with a cfg and time step distilled base model, it can even output a 3 second video in a minute on a mid tier gaming laptop!

Still great thanks to the team that made and open sourced this amazing tool. This IS the most versatile image / video creation tool that I've ever encountered, and it has already achieved quite a lot even in the preview version.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@kijai Thanks for your contribution of making VACE to run perfectly in ComfyUI workflows!

One thing that has come up couple of times in feedback, people would like some way to better choose which control type is actually used, this is referring to the method used in union controlnets, I don't know if it's possible here though.

VACE supports multi control types on a single run, but only supports spatial-separate control signal. For example, you can control a person with pose and the background with depth, but no overlap here. You can try overlapping control signals and may it works, but we are not explicitly train the model like that.
Another possibility to achieve this is to simply apply VACE multiple times with different controls.

Another thing that may already be possible, but I'm unsure how to do would be adjusting the strength of control and reference separately. And related to that, using multiple separate VACE inputs, for which I have used workaround to simply apply VACE multiple times, but it's somewhat clumsy.

We didn't fully realize its importance here and we only test context_scale at default 1. Unfortunately, we don't have any solid suggestion on that. However, since references are seen as additional frames inserted at the beginning of context frame sequence, maybe there is a way to split the reference-related-tensor at the frame dimension from the VACE block output tensor.

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

You did it right. The original implementation is inherited from the training code. For inference, it can be much easier.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@KytraScript

Second, I'd love to see a task focused on lip sync / facial expressions.

Maybe you should try dwpose with facial points. Please refer to the right bottom case of Figure1 of our arxiv paper

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@Inner-Reflections

I have found it not to work too well with sone existing loras especially those trained on styles. would love to have a tile/upscale method.

We just add VACE blocks to the WAN1.3B T2V model, and didn't tune the main Wan T2V backbone. Any LoRA works before should works here. Maybe add more details to your prompt.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@Djanghost

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

If you want something as a part of output video to remain unchanged, just put it to the input context frames (not to reference images, which will not be a part of the output video). And don't forget to set the corresponding mask region to black, which means do not change the pixels under black masks.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@nokai

It might be important to be able to edit the strength, so I could blend the original with the reference.

Yes, I think context_scale is the strength.

It would also be nice to be able to add more than one reference for different purposes.

I believe our src_ref_images inputs is a list.

I also thought it would be nice if we could do it with styles, meaning, for example, if you add a comic book image, it would take the style of the image and not just what you see.

You can just tell model the style you want in prompt in that case.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 8, 2025

@hooplyha

For me a focus on maintaining identity consistency would be amazing! I find identity often changes through generations.

I believe VACE provides a lot of possibilities to control identity. Are you doing long video generation through multiple runs?

@deepbeepmeep
Copy link

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

I have implemented a similar fix in WanGP. The old code created a temporary huge tensor that fragmented the VRAM.

@Ayu889900
Copy link

Ayu889900 commented Apr 8, 2025

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes.
If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames.
Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.

Edit:
From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

@ppc2017
Copy link

ppc2017 commented Apr 8, 2025

VACE supports multi control types on a single run, but only supports spatial-separate control signal. For example, you can control a person with pose and the background with depth, but no overlap here. You can try overlapping control signals and may it works, but we are not explicitly train the model like that. Another possibility to achieve this is to simply apply VACE multiple times with different controls.

Can a scribble be mixed with pose like the following image, is this considered overlapping?
Scribble gives better lip movement but weird/distorted eyes, so I want to keep only the mouth from the scribble and the rest controlled by the pose. With the following example the output video did not follow the lip movements, the mouth stayed open for the whole video.
Image

Another question, is there any way to get only the motion of the camera (without the structure of the video) from a source video?

@Djanghost
Copy link

Djanghost commented Apr 8, 2025

@Djanghost

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

If you want something as a part of output video to remain unchanged, just put it to the input context frames (not to reference images, which will not be a part of the output video). And don't forget to set the corresponding mask region to black, which means do not change the pixels under black masks.

Thank you, it works well!
If you don't mind, would you have an approximate date scheduled for the release of the 14b model ?

@kijai
Copy link

kijai commented Apr 8, 2025

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes. If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames. Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.

Edit: From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

You can indeed have reference image, and additionally any number of input images in the input_frames batch, in any index. You mark the frames you want to keep as black with the input_mask, and the frames you want to change as white.

@Ayu889900
Copy link

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes. If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames. Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.
Edit: From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

You can indeed have reference image, and additionally any number of input images in the input_frames batch, in any index. You mark the frames you want to keep as black with the input_mask, and the frames you want to change as white.

Thank you very much for the clarification! I didn’t expect that the feature I wanted most had already been implemented. Now, I’m just looking forward to the final 14B model.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 9, 2025

Can a scribble be mixed with pose like the following image, is this considered overlapping? Scribble gives better lip movement but weird/distorted eyes, so I want to keep only the mouth from the scribble and the rest controlled by the pose. With the following example the output video did not follow the lip movements, the mouth stayed open for the whole video.

Sorry, we only tested multi-control of separated fore/background and multiple objects. In your case, the scribble part is a bit too detail to be a independent object, and I believe expand it to the whole head would be better.

Another question, is there any way to get only the motion of the camera (without the structure of the video) from a source video?

No, we didn't include camera motion data to VACE training.

@yangqy1110
Copy link

Hello, I would like to ask whether the layout_track task can automatically obtain the bbox and label from the original video

@ppc2017
Copy link

ppc2017 commented Apr 10, 2025

Another use case I tried was creating a seamless transition between two different videos. I added a few frames of the first video at the beginning of the source video, followed by a solid gray frame, and then a few frames from the second video at the end. Then I created the corresponding mask. I tried about 15 times with both simple and complex prompts, but most of the time it just did a quick fade between the two videos. Only once did I get a result that was somewhat okay.

Would training a LoRA with transition videos help, or would that just be a waste of time?

@AILova
Copy link

AILova commented Apr 11, 2025

We are now working on the final 14b version, and we appreciate the voices from users and developers. Let us know if you have experience issues with VACE-Wan2.1-1.3B-Preview or other interesting new ideas that you believe VACE can achieve.

Describe the task you are doing with VACE in detail below, and you may get improvements on it with the final version VACE!

Please give us more feedback!

Can't wait for 14B version... the 1.3B preview is really fantasic! Thanks for this!

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 11, 2025

Hello, I would like to ask whether the layout_track task can automatically obtain the bbox and label from the original video

yes, check this out:

python vace/vace_preproccess.py --task layout_track --mode masktrack --mask assets/masks/test.png --label 'cat' --video assets/videos/test.mp4

different ways to do this.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 11, 2025

Another use case I tried was creating a seamless transition between two different videos. I added a few frames of the first video at the beginning of the source video, followed by a solid gray frame, and then a few frames from the second video at the end. Then I created the corresponding mask. I tried about 15 times with both simple and complex prompts, but most of the time it just did a quick fade between the two videos. Only once did I get a result that was somewhat okay.

Would training a LoRA with transition videos help, or would that just be a waste of time?

In my opinion, prompt is underestimated. Wan2.1 is very sensitive to the prompt. I suggest adjusting prompt and see.

@hanzhn hanzhn pinned this issue Apr 12, 2025
@hanzhn hanzhn unpinned this issue Apr 12, 2025
@hanzhn hanzhn pinned this issue Apr 12, 2025
@pmimaros
Copy link

Why the 14B version is listed as 720x1080 and not as 720x1280?
Is this a typo?

@KytraScript
Copy link

@KytraScript

Second, I'd love to see a task focused on lip sync / facial expressions.

Maybe you should try dwpose with facial points. Please refer to the right bottom case of Figure1 of our arxiv paper

I've tested the dwpose with facial points and I personally do not think it is as good as training with Mediapipe Face instead.

Image
Take this image as an example. If you remove the DWpose facial landmarks and instead use DWpose body + mediapipe face and hand annotations, the models understanding of complex facial structure improves dramatically.

@Njasa2k
Copy link

Njasa2k commented Apr 13, 2025

Is it possible to use a controlnets as an initial frame and end frame instead of image?

Does VACE have start frame and end frame?

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 14, 2025

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?

Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 14, 2025

@KytraScript

I've tested the dwpose with facial points and I personally do not think it is as good as training with Mediapipe Face instead.

Thank you for the information. Mediapipe Face is great for lip sync.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 14, 2025

Why the 14B version is listed as 720x1080 and not as 720x1280? Is this a typo?

Oh, sorry for the mistake. To set this straight, the resolution of 14B version is 720x1280.

@Dibbidy
Copy link

Dibbidy commented Apr 14, 2025

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?

Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

Does it have start frame and end frame control? I don't think I've seen that in the project.

@hanzhn
Copy link
Collaborator Author

hanzhn commented Apr 14, 2025

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?
Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

Does it have start frame and end frame control? I don't think I've seen that in the project.

We didn't exhaustively implement every preprocessing for all tasks, because there are so many ways to use VACE.

Basically you can compose and input 3 types of frames in any order: natural video frames (with black masks to keep them in the output OR white masks to apply colorization), control frames (with white masks), and masked in/outpainting frames (with grey pixels and white mask covering these pixels).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests