User experience improvement for VACE-Wan2.1-1.3B-Preview #27

hanzhn · 2025-04-07T14:07:54Z

We are now working on the final 14b version, and we appreciate the voices from users and developers. Let us know if you have experience issues with VACE-Wan2.1-1.3B-Preview or other interesting new ideas that you believe VACE can achieve.

Describe the task you are doing with VACE in detail below, and you may get improvements on it with the final version VACE!

Please give us more feedback!

Inner-Reflections · 2025-04-07T14:22:35Z

Amazing work! I have been using mostly pose and depth control to good effect. I have found it not to work too well with sone existing loras especially those trained on styles. would love to have a tile/upscale method.

Djanghost · 2025-04-07T14:55:30Z

Hi there,

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

This model is for me the best video model we currently have in the opensource world, so a huge thanks for your work.

kijai · 2025-04-07T15:35:48Z

Hey, thanks for the amazing model(s)!

One thing that has come up couple of times in feedback, people would like some way to better choose which control type is actually used, this is referring to the method used in union controlnets, I don't know if it's possible here though.

Another thing that may already be possible, but I'm unsure how to do would be adjusting the strength of control and reference separately. And related to that, using multiple separate VACE inputs, for which I have used workaround to simply apply VACE multiple times, but it's somewhat clumsy.

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

nokai · 2025-04-07T15:36:25Z

It might be important to be able to edit the strength, so I could blend the original with the reference.

It would also be nice to be able to add more than one reference for different purposes.

I also thought it would be nice if we could do it with styles, meaning, for example, if you add a comic book image, it would take the style of the image and not just what you see.

I don't know if I'm perhaps rambling. Thank you very much for your work.

KytraScript · 2025-04-07T15:55:43Z

First, thank you! This is one of the coolest toolkits to happen for opensource! The work you're doing is incredible.
Second, I'd love to see a task focused on lip sync / facial expressions.

hooplyha · 2025-04-07T17:28:31Z

For me a focus on maintaining identity consistency would be amazing! I find identity often changes through generations. Thank you for your excellent work 👍

deepbeepmeep · 2025-04-07T20:17:02Z

Many thanks for you great model.
Here are my observations:

identity preservation could be better, sometime it is a complete miss
I agree with Kijai, it is not obvious which method the model is going to use. It would be great if we could have a greater control. This could fix some bad interpretation of the targetted task (see Video extension issue below)
For some reason the output of a Video Extension (just give the first frames and the model extrapolates the rest) has it colors completely washed out . Maybe my inputs are wrong : I gave the source video as a control video for the first frames (appended with 127 white frames to complete) and a video mask which is black before the first n frames and white beyond . Am I doing anything wrong ?

able2608 · 2025-04-08T00:50:33Z

For some reason the output of a Video Extension (just give the first frames and the model extrapolates the rest) has it colors completely washed out . Maybe my inputs are wrong : I gave the source video as a control video for the first frames (appended with 127 white frames to complete) and a video mask which is black before the first n frames and white beyond . Am I doing anything wrong ?

It seems that the model wants gray to fill the mask out part.

I agree with Kijai, it is not obvious which method the model is going to use. It would be great if we could have a greater control. This could fix some bad interpretation of the targetted task (see Video extension issue below)

I used OmniGen before, and that particular system has tags that you can use to refer to a particular control signal. It might not be possible to do so without changing the structure of the current model (OmniGen is a single transformer, and both text and image prompts are processed with the same model), but perhaps being able to use instruction-based prompt (like "Generate a video of the guy in black Tshirt in the left reference image giving a speech on the stage in the right of the reference image, with the pose of the pose control video"). This can give the user a more fine-grained control over what the result should look like, instead of letting the model do the guess work. If not using instruction-based prompt is due to the base model not being fine-tuned (I imagine if so then it might not perform well with these kind of prompts), then perhaps give the model the ability to comprehend reference in the prompt (something like "A video of the guy in black Tshirt in the left reference image giving a speech on the stage in the right of the reference image, in the pose of the pose control video") will be good enough.
It will be great if the 1.3b version is also updated. Though 1.3b is tiny compared to other video models, it can already do a lot and the quality is not bad at all, plus it is much more consumer-GPU friendly to run. Paired with a cfg and time step distilled base model, it can even output a 3 second video in a minute on a mid tier gaming laptop!

Still great thanks to the team that made and open sourced this amazing tool. This IS the most versatile image / video creation tool that I've ever encountered, and it has already achieved quite a lot even in the preview version.

hanzhn · 2025-04-08T02:45:11Z

@kijai Thanks for your contribution of making VACE to run perfectly in ComfyUI workflows!

One thing that has come up couple of times in feedback, people would like some way to better choose which control type is actually used, this is referring to the method used in union controlnets, I don't know if it's possible here though.

VACE supports multi control types on a single run, but only supports spatial-separate control signal. For example, you can control a person with pose and the background with depth, but no overlap here. You can try overlapping control signals and may it works, but we are not explicitly train the model like that.
Another possibility to achieve this is to simply apply VACE multiple times with different controls.

Another thing that may already be possible, but I'm unsure how to do would be adjusting the strength of control and reference separately. And related to that, using multiple separate VACE inputs, for which I have used workaround to simply apply VACE multiple times, but it's somewhat clumsy.

We didn't fully realize its importance here and we only test context_scale at default 1. Unfortunately, we don't have any solid suggestion on that. However, since references are seen as additional frames inserted at the beginning of context frame sequence, maybe there is a way to split the reference-related-tensor at the frame dimension from the VACE block output tensor.

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

You did it right. The original implementation is inherited from the training code. For inference, it can be much easier.

hanzhn · 2025-04-08T03:09:21Z

@KytraScript

Second, I'd love to see a task focused on lip sync / facial expressions.

Maybe you should try dwpose with facial points. Please refer to the right bottom case of Figure1 of our arxiv paper

hanzhn · 2025-04-08T03:16:16Z

@Inner-Reflections

I have found it not to work too well with sone existing loras especially those trained on styles. would love to have a tile/upscale method.

We just add VACE blocks to the WAN1.3B T2V model, and didn't tune the main Wan T2V backbone. Any LoRA works before should works here. Maybe add more details to your prompt.

hanzhn · 2025-04-08T03:22:46Z

@Djanghost

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

If you want something as a part of output video to remain unchanged, just put it to the input context frames (not to reference images, which will not be a part of the output video). And don't forget to set the corresponding mask region to black, which means do not change the pixels under black masks.

hanzhn · 2025-04-08T03:27:55Z

@nokai

It might be important to be able to edit the strength, so I could blend the original with the reference.

Yes, I think context_scale is the strength.

It would also be nice to be able to add more than one reference for different purposes.

I believe our src_ref_images inputs is a list.

I also thought it would be nice if we could do it with styles, meaning, for example, if you add a comic book image, it would take the style of the image and not just what you see.

You can just tell model the style you want in prompt in that case.

hanzhn · 2025-04-08T03:33:13Z

@hooplyha

For me a focus on maintaining identity consistency would be amazing! I find identity often changes through generations.

I believe VACE provides a lot of possibilities to control identity. Are you doing long video generation through multiple runs?

deepbeepmeep · 2025-04-08T07:44:45Z

As for the memory use of VACE there are some things I don't fully understand, the initial memory use was very high, a problem may have been caused by me or something else than your implementation, however when I did these changes the memory usage was reduced considerably without any drawback that I can see:

kijai/ComfyUI-WanVideoWrapper@a05d3ac

I have implemented a similar fix in WanGP. The old code created a temporary huge tensor that fragmented the VRAM.

Ayu889900 · 2025-04-08T10:02:44Z

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes.
If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames.
Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.

Edit:
From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

ppc2017 · 2025-04-08T10:03:58Z

VACE supports multi control types on a single run, but only supports spatial-separate control signal. For example, you can control a person with pose and the background with depth, but no overlap here. You can try overlapping control signals and may it works, but we are not explicitly train the model like that. Another possibility to achieve this is to simply apply VACE multiple times with different controls.

Can a scribble be mixed with pose like the following image, is this considered overlapping?
Scribble gives better lip movement but weird/distorted eyes, so I want to keep only the mouth from the scribble and the rest controlled by the pose. With the following example the output video did not follow the lip movements, the mouth stayed open for the whole video.

Another question, is there any way to get only the motion of the camera (without the structure of the video) from a source video?

Djanghost · 2025-04-08T11:21:06Z

@Djanghost

It would be great to be able to keep the first image passed as input unchanged (as with the Fun Control model), in case we wish to start from an existing scene without altering it.

If you want something as a part of output video to remain unchanged, just put it to the input context frames (not to reference images, which will not be a part of the output video). And don't forget to set the corresponding mask region to black, which means do not change the pixels under black masks.

Thank you, it works well!
If you don't mind, would you have an approximate date scheduled for the release of the 14b model ?

kijai · 2025-04-08T13:29:49Z

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes. If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames. Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.

Edit: From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

You can indeed have reference image, and additionally any number of input images in the input_frames batch, in any index. You mark the frames you want to keep as black with the input_mask, and the frames you want to change as white.

Ayu889900 · 2025-04-08T13:47:21Z

Thank you for your great work. During my usage, the biggest need I’ve felt is for both start and end frames + process control. Currently, in Kijai’s workflow, we can only use a single reference frame and control frames, which makes character features uncontrollable when the person turns around or undergoes significant changes. If it were possible to use both the start and end frames, along with control methods like depth during the process, everything would work much better. This is because I can easily generate the desired start and end frames using various existing image diffusion models, and then just use VACE combined with depth or similar controls to generate the intermediate frames. Additionally, I believe that if the ultimate goal is to allow inserting target frames at any point within a video, while also supporting control frames for the entire sequence, then this tool would become truly production-ready. It would essentially enable fully controllable generation of any video I want, with near-perfect precision.
Edit: From what I read in the reply above, it seems that I might be able to achieve inserting a target frame at any point in the video combined with depth or similar controls, by changing the frame I want to keep unchanged from a control frame to a target frame, and then changing the corresponding mask at that position from white to black. Is that correct?

You can indeed have reference image, and additionally any number of input images in the input_frames batch, in any index. You mark the frames you want to keep as black with the input_mask, and the frames you want to change as white.

Thank you very much for the clarification! I didn’t expect that the feature I wanted most had already been implemented. Now, I’m just looking forward to the final 14B model.

hanzhn · 2025-04-09T02:57:57Z

Can a scribble be mixed with pose like the following image, is this considered overlapping? Scribble gives better lip movement but weird/distorted eyes, so I want to keep only the mouth from the scribble and the rest controlled by the pose. With the following example the output video did not follow the lip movements, the mouth stayed open for the whole video.

Sorry, we only tested multi-control of separated fore/background and multiple objects. In your case, the scribble part is a bit too detail to be a independent object, and I believe expand it to the whole head would be better.

Another question, is there any way to get only the motion of the camera (without the structure of the video) from a source video?

No, we didn't include camera motion data to VACE training.

yangqy1110 · 2025-04-10T10:51:48Z

Hello, I would like to ask whether the layout_track task can automatically obtain the bbox and label from the original video

ppc2017 · 2025-04-10T15:49:20Z

Another use case I tried was creating a seamless transition between two different videos. I added a few frames of the first video at the beginning of the source video, followed by a solid gray frame, and then a few frames from the second video at the end. Then I created the corresponding mask. I tried about 15 times with both simple and complex prompts, but most of the time it just did a quick fade between the two videos. Only once did I get a result that was somewhat okay.

Would training a LoRA with transition videos help, or would that just be a waste of time?

AILova · 2025-04-11T00:05:26Z

We are now working on the final 14b version, and we appreciate the voices from users and developers. Let us know if you have experience issues with VACE-Wan2.1-1.3B-Preview or other interesting new ideas that you believe VACE can achieve.

Describe the task you are doing with VACE in detail below, and you may get improvements on it with the final version VACE!

Please give us more feedback!

Can't wait for 14B version... the 1.3B preview is really fantasic! Thanks for this!

hanzhn · 2025-04-11T02:20:29Z

Hello, I would like to ask whether the layout_track task can automatically obtain the bbox and label from the original video

yes, check this out:

VACE/run_vace_preproccess.sh

Line 28 in 9ea71b5

    
           python vace/vace_preproccess.py --task layout_track --mode masktrack --mask assets/masks/test.png  --label 'cat' --video assets/videos/test.mp4

different ways to do this.

hanzhn · 2025-04-11T02:23:59Z

Another use case I tried was creating a seamless transition between two different videos. I added a few frames of the first video at the beginning of the source video, followed by a solid gray frame, and then a few frames from the second video at the end. Then I created the corresponding mask. I tried about 15 times with both simple and complex prompts, but most of the time it just did a quick fade between the two videos. Only once did I get a result that was somewhat okay.

Would training a LoRA with transition videos help, or would that just be a waste of time?

In my opinion, prompt is underestimated. Wan2.1 is very sensitive to the prompt. I suggest adjusting prompt and see.

pmimaros · 2025-04-12T09:23:12Z

Why the 14B version is listed as 720x1080 and not as 720x1280?
Is this a typo?

KytraScript · 2025-04-12T19:06:34Z

@KytraScript

Second, I'd love to see a task focused on lip sync / facial expressions.

Maybe you should try dwpose with facial points. Please refer to the right bottom case of Figure1 of our arxiv paper

I've tested the dwpose with facial points and I personally do not think it is as good as training with Mediapipe Face instead.

Take this image as an example. If you remove the DWpose facial landmarks and instead use DWpose body + mediapipe face and hand annotations, the models understanding of complex facial structure improves dramatically.

Njasa2k · 2025-04-13T16:13:49Z

Is it possible to use a controlnets as an initial frame and end frame instead of image?

Does VACE have start frame and end frame?

hanzhn · 2025-04-14T03:08:49Z

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?

Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

hanzhn · 2025-04-14T03:12:50Z

@KytraScript

I've tested the dwpose with facial points and I personally do not think it is as good as training with Mediapipe Face instead.

Thank you for the information. Mediapipe Face is great for lip sync.

hanzhn · 2025-04-14T04:53:02Z

Why the 14B version is listed as 720x1080 and not as 720x1280? Is this a typo?

Oh, sorry for the mistake. To set this straight, the resolution of 14B version is 720x1280.

Dibbidy · 2025-04-14T05:23:54Z

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?

Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

Does it have start frame and end frame control? I don't think I've seen that in the project.

hanzhn · 2025-04-14T07:15:55Z

Is it possible to use a controlnets as an initial frame and end frame instead of image?

You mean using control frames such as pose/depth instead of natural video frames?
Yes, VACE is very versatile and flexible to use. You might explore boldly and find.

Does it have start frame and end frame control? I don't think I've seen that in the project.

We didn't exhaustively implement every preprocessing for all tasks, because there are so many ways to use VACE.

Basically you can compose and input 3 types of frames in any order: natural video frames (with black masks to keep them in the output OR white masks to apply colorization), control frames (with white masks), and masked in/outpainting frames (with grey pixels and white mask covering these pixels).

hanzhn added the help wanted Extra attention is needed label Apr 7, 2025

yourtone mentioned this issue Apr 8, 2025

What is the difference between Preview and non-Preview version models? #18

Closed

hanzhn pinned this issue Apr 12, 2025

hanzhn unpinned this issue Apr 12, 2025

hanzhn pinned this issue Apr 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

hanzhn commented Apr 7, 2025

Inner-Reflections commented Apr 7, 2025 •

edited

Loading

Djanghost commented Apr 7, 2025

kijai commented Apr 7, 2025

nokai commented Apr 7, 2025

KytraScript commented Apr 7, 2025

hooplyha commented Apr 7, 2025

deepbeepmeep commented Apr 7, 2025

able2608 commented Apr 8, 2025

hanzhn commented Apr 8, 2025 •

edited

Loading

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025 •

edited

Loading

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025

deepbeepmeep commented Apr 8, 2025

Ayu889900 commented Apr 8, 2025 •

edited

Loading

ppc2017 commented Apr 8, 2025 •

edited

Loading

Djanghost commented Apr 8, 2025 •

edited

Loading

kijai commented Apr 8, 2025

Ayu889900 commented Apr 8, 2025

hanzhn commented Apr 9, 2025

yangqy1110 commented Apr 10, 2025

ppc2017 commented Apr 10, 2025

AILova commented Apr 11, 2025

hanzhn commented Apr 11, 2025

hanzhn commented Apr 11, 2025

pmimaros commented Apr 12, 2025

KytraScript commented Apr 12, 2025

Njasa2k commented Apr 13, 2025 •

edited

Loading

hanzhn commented Apr 14, 2025

hanzhn commented Apr 14, 2025

hanzhn commented Apr 14, 2025

Dibbidy commented Apr 14, 2025

hanzhn commented Apr 14, 2025 •

edited

Loading

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

User experience improvement for VACE-Wan2.1-1.3B-Preview #27

Comments

hanzhn commented Apr 7, 2025

Inner-Reflections commented Apr 7, 2025 • edited Loading

Djanghost commented Apr 7, 2025

kijai commented Apr 7, 2025

nokai commented Apr 7, 2025

KytraScript commented Apr 7, 2025

hooplyha commented Apr 7, 2025

deepbeepmeep commented Apr 7, 2025

able2608 commented Apr 8, 2025

hanzhn commented Apr 8, 2025 • edited Loading

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025 • edited Loading

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025

hanzhn commented Apr 8, 2025

deepbeepmeep commented Apr 8, 2025

Ayu889900 commented Apr 8, 2025 • edited Loading

ppc2017 commented Apr 8, 2025 • edited Loading

Djanghost commented Apr 8, 2025 • edited Loading

kijai commented Apr 8, 2025

Ayu889900 commented Apr 8, 2025

hanzhn commented Apr 9, 2025

yangqy1110 commented Apr 10, 2025

ppc2017 commented Apr 10, 2025

AILova commented Apr 11, 2025

hanzhn commented Apr 11, 2025

hanzhn commented Apr 11, 2025

pmimaros commented Apr 12, 2025

KytraScript commented Apr 12, 2025

Njasa2k commented Apr 13, 2025 • edited Loading

hanzhn commented Apr 14, 2025

hanzhn commented Apr 14, 2025

hanzhn commented Apr 14, 2025

Dibbidy commented Apr 14, 2025

hanzhn commented Apr 14, 2025 • edited Loading

Inner-Reflections commented Apr 7, 2025 •

edited

Loading

hanzhn commented Apr 8, 2025 •

edited

Loading

hanzhn commented Apr 8, 2025 •

edited

Loading

Ayu889900 commented Apr 8, 2025 •

edited

Loading

ppc2017 commented Apr 8, 2025 •

edited

Loading

Djanghost commented Apr 8, 2025 •

edited

Loading

Njasa2k commented Apr 13, 2025 •

edited

Loading

hanzhn commented Apr 14, 2025 •

edited

Loading