You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I was reading your code and found the part that disagrees with my knowledge.
In CogVideo/finetune/models/cogvideox_i2v/lora_trainer.py compute_loss, there's a denoising procedure.
# Denoise
latent_pred = self.components.scheduler.get_velocity(predicted_noise, latent_noisy, timesteps)
alphas_cumprod = self.components.scheduler.alphas_cumprod[timesteps]
weights = 1 / (1 - alphas_cumprod)
while len(weights.shape) < len(latent_pred.shape):
weights = weights.unsqueeze(-1)
loss = torch.mean((weights * (latent_pred - latent) ** 2).reshape(batch_size, -1), dim=1)
loss = loss.mean()
return loss
In the last part, you defined a loss using latent_pred and latent. latent_pred corresponds to velocity at time step t, whereas latent corresponds to ground truth latent image representation.
As far as I know, velocity does not directly correspond to latent image representation, I think there should be one more line reconstructing from velocity to latent image space.
If this reconstruction is not necessary, could you explain how this loss works during training?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I was reading your code and found the part that disagrees with my knowledge.
In CogVideo/finetune/models/cogvideox_i2v/lora_trainer.py compute_loss, there's a denoising procedure.
In the last part, you defined a loss using latent_pred and latent.
latent_pred corresponds to velocity at time step t, whereas latent corresponds to ground truth latent image representation.
As far as I know, velocity does not directly correspond to latent image representation, I think there should be one more line reconstructing from velocity to latent image space.
If this reconstruction is not necessary, could you explain how this loss works during training?
Thank you!!
Beta Was this translation helpful? Give feedback.
All reactions