Skip to content

[SIGGRAPH Asia 2023] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

License

Notifications You must be signed in to change notification settings

williamyang1991/Rerender_A_Video

Repository files navigation

Rerender A Video - Official PyTorch Implementation

teaser

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
in SIGGRAPH Asia 2023 Conference Proceedings
Project Page | Paper | Supplementary Video | Input Data and Video Results

Web Demo visitors

Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Features:

  • Temporal consistency: cross-frame constraints for low-level temporal consistency.
  • Zero-shot: no training or fine-tuning required.
  • Flexibility: compatible with off-the-shelf models (e.g., ControlNet, LoRA) for customized translation.
overview.mp4

Updates

TODO

  • Integrate into Diffusers.
  • Integrate FreeU into Rerender
  • Add Inference instructions in README.md.
  • Add Examples to webUI.
  • Add optional poisson fusion to the pipeline.
  • Add Installation instructions for Windows

Installation

Please make sure your installation path only contain English letters or _

  1. Clone the repository. (Don't forget --recursive. Otherwise, please run git submodule update --init --recursive)
git clone git@github.com:williamyang1991/Rerender_A_Video.git --recursive
cd Rerender_A_Video
  1. If you have installed PyTorch CUDA, you can simply set up the environment with pip.
pip install -r requirements.txt

You can also create a new conda environment from scratch.

conda env create -f environment.yml
conda activate rerender

24GB VRAM is required. Please refer to #23 (comment) to reduce memory consumption.

  1. Run the installation script. The required models will be downloaded in ./models.
python install.py
  1. You can run the demo with rerender.py
python rerender.py --cfg config/real2sculpture.json
Installation on Windows

Before running the above 1-4 steps, you need prepare:

  1. Install CUDA
  2. Install git
  3. Install VS with Windows 10/11 SDK (for building deps/ebsynth/bin/ebsynth.exe)
  4. Here are more information. If building ebsynth fails, we provides our complied ebsynth.
🔥🔥🔥 Installation or Running Fails? 🔥🔥🔥
  1. In case building ebsynth fails, we provides our complied ebsynth
  2. FileNotFoundError: [Errno 2] No such file or directory: 'xxxx.bin' or 'xxxx.jpg':
    • make sure your path only contains English letters or _ (#18 (comment))
    • find the code python video_blend.py ... in the error log and use it to manually run the ebsynth part, which is more stable than WebUI.
    • if some non-keyframes are generated but somes are not, rather than missing all non-keyframes in '/out_xx/', you may refer to #38 (comment)
    • Enable the Execute permission of deps/ebsynth/bin/ebsynth
    • Enable the debug log to find more information
      OPEN_EBSYNTH_LOG = False
  3. KeyError: 'dataset': upgrade Gradio to the latest version (#14 (comment), AUTOMATIC1111/stable-diffusion-webui#11855)
  4. Error when processing videos: manually install ffmpeg (#19 (comment), #29 (comment))
  5. ERR_ADDRESS_INVALID Cannot open the webUI in browser: replace 0.0.0.0 with 127.0.0.1 in webUI.py (#19 (comment))
  6. CUDA out of memory:
    • Using xformers (#23 (comment))
    • Set "use_limit_device_resolution" to true in the config to resize the video according to your VRAM (#79). An example config config/van_gogh_man_dynamic_resolution.json is provided.
  7. AttributeError: module 'keras.backend' has no attribute 'is_tensor': update einops (#26 (comment))
  8. IndexError: list index out of range: use the original DDIM steps of 20 (#30 (comment))
  9. One-click installation #99

(1) Inference

WebUI (recommended)

python webUI.py

The Gradio app also allows you to flexibly change the inference options. Just try it for more details. (For WebUI, you need to download revAnimated_v11 and realisticVisionV20_v20 to ./models/ after Installation)

Upload your video, input the prompt, select the seed, and hit:

  • Run 1st Key Frame: only translate the first frame, so you can adjust the prompts/models/parameters to find your ideal output appearance before running the whole video.
  • Run Key Frames: translate all the key frames based on the settings of the first frame, so you can adjust the temporal-related parameters for better temporal consistency before running the whole video.
  • Run Propagation: propagate the key frames to other frames for full video translation
  • Run All: Run 1st Key Frame, Run Key Frames and Run Propagation

UI

We provide abundant advanced options to play with

Using customized models
  • Using LoRA/Dreambooth/Finetuned/Mixed SD models
    • Modify sd_model_cfg.py to add paths to the saved SD models
    • How to use LoRA: #39 (comment)
  • Using other controls from ControlNet (e.g., Depth, Pose)
Advanced options for the 1st frame translation
  1. Resolution related (Frame resolution, left/top/right/bottom crop length): crop the frame and resize its short side to 512.
  2. ControlNet related:
    • ControlNet strength: how well the output matches the input control edges
    • Control type: HED edge or Canny edge
    • Canny low/high threshold: low values for more edge details
  3. SDEdit related:
    • Denoising strength: repaint degree (low value to make the output look more like the original video)
    • Preserve color: preserve the color of the original video
  4. SD related:
    • Steps: denoising step
    • CFG scale: how well the output matches the prompt
    • Base model: base Stable Diffusion model (SD 1.5)
    • Added prompt/Negative prompt: supplementary prompts
  5. FreeU related:
    • FreeU first/second-stage backbone factor: =1 do nothing; >1 enhance output color and details
    • FreeU first/second-stage skip factor: =1 do nothing; <1 enhance output color and details
Advanced options for the key frame translation
  1. Key frame related
    • Key frame frequency (K): Uniformly sample the key frame every K frames. Small value for large or fast motions.
    • Number of key frames (M): The final output video will have K*M+1 frames with M+1 key frames.
  2. Temporal consistency related
    • Cross-frame attention:
      • Cross-frame attention start/end: When applying cross-frame attention for global style consistency
      • Cross-frame attention update frequency (N): Update the reference style frame every N key frames. Should be large for long videos to avoid error accumulation.
      • Loose Cross-frame attention: Using cross-frame attention in fewer layers to better match the input video (for video with large motions)
    • Shape-aware fusion Check to use this feature
      • Shape-aware fusion start/end: When applying shape-aware fusion for local shape consistency
    • Pixel-aware fusion Check to use this feature
      • Pixel-aware fusion start/end: When applying pixel-aware fusion for pixel-level temporal consistency
      • Pixel-aware fusion strength: The strength to preserve the non-inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
      • Pixel-aware fusion detail level: The strength to sharpen the inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
      • Smooth fusion boundary: Check to smooth the inpainting boundary (avoid error accumulation).
    • Color-aware AdaIN Check to use this feature
      • Color-aware AdaIN start/end: When applying AdaIN to make the video color consistent with the first frame
Advanced options for the full video translation
  1. Gradient blending: apply Poisson Blending to reduce ghosting artifacts. May slow the process and increase flickers.
  2. Number of parallel processes: multiprocessing to speed up the process. Large value (8) is recommended.

options

Command Line

We also provide a flexible script rerender.py to run our method.

Simple mode

Set the options via command line. For example,

python rerender.py --input videos/pexels-antoni-shkraba-8048492-540x960-25fps.mp4 --output result/man/man.mp4 --prompt "a handsome man in van gogh painting"

The script will run the full pipeline. A work directory will be created at result/man and the result video will be saved as result/man/man.mp4

Advanced mode

Set the options via a config file. For example,

python rerender.py --cfg config/van_gogh_man.json

The script will run the full pipeline. We provide some examples of the config in config directory. Most options in the config is the same as those in WebUI. Please check the explanations in the WebUI section.

Specifying customized models by setting sd_model in config. For example:

{
  "sd_model": "models/realisticVisionV20_v20.safetensors",
}

Customize the pipeline

Similar to WebUI, we provide three-step workflow: Rerender the first key frame, then rerender the full key frames, finally rerender the full video with propagation. To run only a single step, specify options -one, -nb and -nr:

  1. Rerender the first key frame
python rerender.py --cfg config/van_gogh_man.json -one -nb
  1. Rerender the full key frames
python rerender.py --cfg config/van_gogh_man.json -nb
  1. Rerender the full video with propagation
python rerender.py --cfg config/van_gogh_man.json -nr

Our Ebsynth implementation

We provide a separate Ebsynth python script video_blend.py with the temporal blending algorithm introduced in Stylizing Video by Example for interpolating style between key frames. It can work on your own stylized key frames independently of our Rerender algorithm.

Usage:

video_blend.py [-h] [--output OUTPUT] [--fps FPS] [--beg BEG] [--end END] [--itv ITV] [--key KEY]
                      [--n_proc N_PROC] [-ps] [-ne] [-tmp]
                      name

positional arguments:
  name             Path to input video

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  Path to output video
  --fps FPS        The FPS of output video
  --beg BEG        The index of the first frame to be stylized
  --end END        The index of the last frame to be stylized
  --itv ITV        The interval of key frame
  --key KEY        The subfolder name of stylized key frames
  --n_proc N_PROC  The max process count
  -ps              Use poisson gradient blending
  -ne              Do not run ebsynth (use previous ebsynth output)
  -tmp             Keep temporary output

For example, to run Ebsynth on video man.mp4,

  1. Put the stylized key frames to videos/man/keys for every 10 frames (named as 0001.png, 0011.png, ...)
  2. Put the original video frames in videos/man/video (named as 0001.png, 0002.png, ...).
  3. Run Ebsynth on the first 101 frames of the video with poisson gradient blending and save the result to videos/man/blend.mp4 under FPS 25 with the following command:
python video_blend.py videos/man \
  --beg 1 \
  --end 101 \
  --itv 10 \
  --key keys \
  --output videos/man/blend.mp4 \
  --fps 25.0 \
  -ps

(2) Results

Key frame translation

white ancient Greek sculpture, Venus de Milo, light pink and blue background a handsome Greek man a traditional mountain in chinese ink wash painting a cartoon tiger
a swan in chinese ink wash painting, monochrome a beautiful woman in CG style a clean simple white jade sculpture a fluorescent jellyfish in the deep dark blue sea

Full video translation

Text-guided virtual character generation.

more_result_1.mp4
more_result_2.mp4

Video stylization and video editing.

more_result_3.mp4

New Features

Compared to the conference version, we are keeping adding new features.

new_feature

Loose cross-frame attention

By using cross-frame attention in less layers, our results will better match the input video, thus reducing ghosting artifacts caused by inconsistencies. This feature can be activated by checking Loose Cross-frame attention in the Advanced options for the key frame translation for WebUI or setting loose_cfattn for script (see config/real2sculpture_loose_cfattn.json).

FreeU

FreeU is a method that improves diffusion model sample quality at no costs. We find featured with FreeU, our results will have higher contrast and saturation, richer details, and more vivid colors. This feature can be used by setting FreeU backbone factors and skip factors in the Advanced options for the 1st frame translation for WebUI or setting freeu_args for script (see config/real2sculpture_freeu.json).

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yang2023rerender,
 title = {Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation},
 author = {Yang, Shuai and Zhou, Yifan and Liu, Ziwei and and Loy, Chen Change},
 booktitle = {ACM SIGGRAPH Asia Conference Proceedings},
 year = {2023},
}

Acknowledgments

The code is mainly developed based on ControlNet, Stable Diffusion, GMFlow and Ebsynth.

About

[SIGGRAPH Asia 2023] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published