Trained for 300 epochs
Trained for 500 epochs(needs more training)
This repository implements Latent Diffusion Transformer for Video Generation Paper. It provides code for the following:
- Training and inference of VAE on Moving Mnist and UCF101 frames
- Training and Inference of Latte Video Model using trained VAE on 16 frame video clips of both datasets
- Configurable code for training all models from Latte-S to Latte-XL
This repo has few changes from the official Latte implementation except the following changes.
- Current code is for unconditional generation
- Variance is fixed during training and not learned (like original DDPM)
- No EMA
- Ability to train VAE
- Ability to save latents of video frames for faster training
- Create a new conda environment with python 3.10 then run below commands
conda activate <environment_name>
git clone https://github.com/explainingai-code/VideoGeneration-PyTorch.git
cd VideoGeneration-PyTorch
pip install -r requirements.txt
- Download lpips weights by opening this link in browser(dont use wget) https://github.com/richzhang/PerceptualSimilarity/blob/master/lpips/weights/v0.1/vgg.pth and downloading the raw file. Place the downloaded weights file in
model/weights/v0.1/vgg.pth
For moving mnist I have used the easy category of videos which have one number moving across frames.
Download the videos from this page. This includes captions as well so one can also try with text to video models using this.
Create a data
directory in the repo root and add the downloaded mmnist-easy
folder there.
Ensure directory structure is the following
VideoGeneration-PyTorch
-> data
-> mmnist-easy
-> *.mp4
For setting up UCF101, simply download the videos from the official page here
and add it to data
directory.
Ensure directory structure is the following
VideoGeneration-PyTorch
-> data
-> UCF101
-> *.avi
Allows you to play with different components of Latte and autoencoder
config/mnist.yaml
- Configuration used for moving mnist datasetconfig/ucf.yaml
- Configuration used for ucf dataset Important configuration parametersautoencoder_acc_steps
: For accumulating gradients if video size is too large and a large batch size cant be used.
The repo provides training and inference for Moving Mnist (Unconditional Latte Model)
For working on your own dataset:
- Create your own config and ensure following config parameters are correctly set
im_path
- Folder name where latents will be saved (whensave_latents
script is run later)video_path
- Path to the videosvideo_ext
- Extension for videos. Assumption is all videos are same Extensionframe_height
,frame_width
, `frame_channels' - Dimension to which frames would be resized to.centre_square_crop
- If center cropping is needed on frames or notvideo_filter_path
- null / location of txt file which contains video names that need to be taken for training. Ifnull
then all videos will be used(like moving mnist). For seeing how to construct this filter file look atdataset/ucf_filter.txt
file
- The existing
video_dataset.py
should require minimal modifications to adapt to your dataset requirements
Once the config and dataset is setup:
- First train the auto encoder on your dataset using this section
- For training and inference of Unconditional Latte follow this section
- We need to first extract frames for training autoencoder
- By default, we extract only 10% of frames from our videos. Change
ae_frame_sample_prob
indataset_params
of config if you want to train on larger number of frames. For both moving mnist and ucf, 10% works fine. - If you need to train the latte model on a subset of videos then ensure the
video_filter_path
is correctly set. Look at ucf config anddataset/ucf_filter.txt
for guidance. - Extract the frames by running
python -m capture_frames --config config/mnist.yaml
with the right config value
- By default, we extract only 10% of frames from our videos. Change
- For training autoencoder
- Minimal modifications might be needed to
dataset/iamge_dataset.py
to adapt to your own dataset. - Make sure the
frame_height
andframe_width
parameters are correctly set for resizing frames(if needed) - Run
python -m tools.train_vae --config config/mnist.yaml
for training autoencoder with the right config file
- Minimal modifications might be needed to
- In case you gpu memory is limited, I would suggest to run
python -m tools.save_latents --config config/mnist.yaml
with the correct config. This script would save the latent frames for all your dataset videos. Otherwise during diffusion model training we would have to load vae also.
Train the autoencoder first and make changes to config and video_dataset.py
(if any needed) to adapt to your requirements.
python -m tools.train_vae_ditv --config config/mnist.yaml
for training unconditional Latte using right configpython -m tools.sample_vae_ditv --config config/mnist.yaml
for generating videos using trained Latte model
Outputs will be saved according to the configuration present in yaml files.
For every run a folder of task_name
key in config will be created.
During frame extraction , folder name for the key im_path
will be created in task_name
directory and frames will be saved in there.
During training of autoencoder the following output will be saved
- Latest Autoencoder and discriminator checkpoint in
task_name
directory - Sample reconstructions in
task_name/vae_autoencoder_samples
If save_latents
script is run then latents will be saved in task_name/save_video_latent_dir
if mentioned in config
During training and inference of unconditional Latte following output will be saved:
- During training we will save the latest checkpoint in
task_name
directory - During sampling, unconditional sampled video generated for all timesteps will be saved in
task_name/samples/*.mp4
. The final decoded generated video will besample_video_0.mp4
. Videos fromsample_output_999.mp4
tosample_output_1.mp4
will be latent video predictions of denoising process from T=999 to T=1. Final Generated Video is at T=0
@misc{ma2024lattelatentdiffusiontransformer,
title={Latte: Latent Diffusion Transformer for Video Generation},
author={Xin Ma and Yaohui Wang and Gengyun Jia and Xinyuan Chen and Ziwei Liu and Yuan-Fang Li and Cunjian Chen and Yu Qiao},
year={2024},
eprint={2401.03048},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2401.03048},
}