-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LTX Image2Video LoRA #150
base: main
Are you sure you want to change the base?
LTX Image2Video LoRA #150
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -11,6 +11,7 @@ class Args: | |||
The arguments for the finetrainers training script. | |||
|
|||
Args: | |||
TODO: write informational docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should be clubbed in separate PRs. Just TODO note is fine!
@@ -391,6 +392,41 @@ def _add_diffusion_arguments(parser: argparse.ArgumentParser) -> None: | |||
) | |||
|
|||
|
|||
def _add_regularization_arguments(parser: argparse.ArgumentParser) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful!
@@ -391,6 +392,41 @@ def _add_diffusion_arguments(parser: argparse.ArgumentParser) -> None: | |||
) | |||
|
|||
|
|||
def _add_regularization_arguments(parser: argparse.ArgumentParser) -> None: | |||
parser.add_argument( | |||
"--caption_dropout_p", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's validate this to always be in [0, 1] if not already?
help="Technique to use for caption dropout.", | ||
) | ||
parser.add_argument( | ||
"--image_condition_dropout_p", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
else: | ||
# Map from [0, 1] to [0, image_condition_noise_scale] | ||
scale_factor = random.random() * image_condition_noise_scale | ||
# :/ Because we don't have torch.randn_like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we referring to this? Also, do we always have to add noise to the conditional latent? For Flux Control, we keep the conditional latent clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, torch.randn_like does not support using a generator
, that's why I first create an empty tensor and then call normal_
to create the gaussian noise. I would like to try and maintain 100% reproducible runs so it is vital we always do this where needed, and try to reach that stage if not already
Here, I tried making an educated guess that adding noise to the image would serve as a good regularizer because we don't have information on how it was trained. I haven't done enough experiments to see that it yields any benefits so I will be removing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Flux as well, we chose to not add any noise, but typically this should be experimented with so that the model can learn to pay better attention to control signals even if they are noisy. I have seen this being done atleast a few places now, so assumed it would make sense but unless I run a large 10k-50k step run, it would be hard to evaluate its effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, IMO, it's okay to keep the argument for experimentation purposes. I will try to do this for Flux Control too and see if we get any effects there.
video_noisy_latents = (1.0 - sigmas) * latents[:, image_frame_end_offset:] + sigmas * noise[ | ||
:, image_frame_end_offset: | ||
] | ||
noisy_latents = torch.cat([image_latents, video_noisy_latents], dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be noisy_latents = torch.cat([video_noisy_latents, image_latents], dim=1)
?
This is what we do in Cog, too:
noisy_model_input = torch.cat([noisy_video_latents, image_latents], dim=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cog is a different architecture that uses channel-wise concatenated latents. LTXV does not use channelwise-concatenation. Instead, we keep first frame clean and all other frames as noise for LTX. We also don't perform any denoising on the first clean frame (it is not part of the loss either), and only denoise the remaining frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that is cool. Thanks for explaining. Then let’s make this a note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and this -- suggests we should definitely experiment if noising the conditional latent is a good regularizer in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoavhacohen I made an assumption that the image latent is always kept clean and no denoising is applied to it. Is that the case when training LTXV? Or do you add a varying amount of noise based on some randomly sampled timestep (different from that of other frame latents) and perform denoising on it too? This is the first time I've seen this technique of per-token/per-frame denoising level so I'm not sure what to do without making guesses :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend training text-to-video and image-to-video models simultaneously, adding a varying amount of noise based on a randomly sampled small timestep.
For training with image conditioning, you just need to determine the noise scheduler to apply to the tokens that correspond to the conditioning frame (versus the other tokens).
I recommend making the implementation generic rather than specific to the first frame - it should support conditioning on any subset of tokens, not just those corresponding to the first frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! And generally speaking, the noise to be added to the conditioning frame — should it have a smaller magnitude than the one being added to the rest of the tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoavhacohen That sounds great, thank you! Since we don't support the idea of per-token timesteps in diffusers, I think we might have to write a custom scheduler step implementation - will give this a stab soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The noise applied to conditioning tokens should be reduced compared to other tokens in the sequence - that's actually what defines them as conditioning tokens.
# Map from [0, 1] to [0, image_condition_noise_scale] | ||
scale_factor = random.random() * image_condition_noise_scale | ||
# :/ Because we don't have torch.randn_like | ||
latents[:, :, 0] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoavhacohen Would love to know your thoughts on this as well. It was an educated guess based on some other training code I've come across for image latent regularisation. Since we're not entirely sure how LTXV was trained, this may probably not be helpful and cause worse results. I'm still experimenting but any details of training for things like this would be super awesome 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add noise to the conditioning tokens in the same way as the other tokens - just use a different noise level.
image_or_video = image_or_video.to(device=device, dtype=vae.dtype) | ||
image_or_video = image_or_video.permute(0, 2, 1, 3, 4).contiguous() # [B, C, F, H, W] -> [B, F, C, H, W] | ||
|
||
# Note: we separately encode the image and video because there is a 4x compression applied. We only want to condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yoavhacohen, I'll make the update soon! Makes sense looking at it in hindsight, since this was mostly just guesswork
What are some settings I can use if I want to train on an RTX 3060 12 GB VRAM? I read that if you turn off validation, you only need 11 GB VRAM. What other optimizations can I use? |
You could try |
Can we test this PR or the comments about the TODO changes are breaking training and not worth testing until done ? :) |
Hi @scarbain. Sorry, I haven't had the time to move this to completion yet. My last training run did not yield particularly interesting results, and I'm yet to address some comments from the original model author, so would recommend waiting until merged unless you have access to ample GPU resources for testing/debugging. Will try and complete soon after a new suite of memory optimizations in the coming days |
Hi @a-r-r-o-w ! I'm sorry for asking again, I certainly don't want to pressure you on this because all your work on this repository is a great gift to the community and you should prioritise how you want. Do you have an approximate ETA for completing this PR ? :) |
In order to run the following example, one needs atleast 24 GB VRAM (because of using 161 frames. if you set the resolution buckets to use lower amount of frames, vram requirements will be lower).
In order to run, one needs the same dataset format that we've been using so far:
prompts.txt
videos.txt
videos/
Since this is
image-to-video
training, one also needs validation images in addition to prompts. As an example, this can be done by simply taking the first frame of your training videos:Ideally, you should test with different starting images other than your training videos to verify if the LoRA works.
script
Note that I don't know if the training works yet. I've queued some runs that should finish overnight if there were no bugs that would cause a crash.