controlnetvideo

Stable Diffusion Video2Video with Feedback

This python script is a command line tool for rerendering videos with stable diffusion models, making use of huggingface diffusers library and various other open source projects. It attempts to solve the problem of frame-to-frame consistency by various methods, primarily motion transfer from dense optical flow of the input to the output, fed back either using controlnets or reference only attention coupling.

This is not the right way to do this, clearly the models need to be extended to allow temporal information passing between frames and trained on video datasets, which, in the time since this scripts first incarnation, has proven very effective. I maintain this script and the techniques it uses as a curious and aesthetically interesting aside. Enjoy

by Victor Condino un1tz3r0@gmail.com, Oct 17 2024

New! Features

Supports SDXL Reference Only (ADAIN) (best results) and ControlNet (experimental)
Supports SDXL ControlNets
Music video beat-synced animation
Animation with arbitrary piecewise cubic spline curves
Flux.1 (initial support, works, only canny controlnet supported)

I've added some support for Flux.1, and extensive SDXL support; both controlnet and reference-only control schemes.

Also new is an animation system that can sync parameter modulations to the beat of an audio file. This uses the madmom to analyze the audio, and allows specifying piecewise cubic bezier curves to modulate various parameters of the video processing in time with the downbeats detected in the audio track. (Examples coming soon.)

Examples of `--animation-params ...` and `--audio-from ...`

controlnetvideo.py \
	~/Downloads/PXL_20240922_094238715.TS.mp4 \
	outw.mp4 \
	--prompt "by takashi murakami" \
	--dump-frames progress.png \
	--show-input \
	--show-output \
	--show-motion \
	--color-info \
	--motion-sigma 3.0 \
	--motion-alpha 0.5 \
	--color-fix none \
	--feedthrough-strength 0.00 \
	--init-image-strength 0.35 \
	--controlnet refxl \
	--swap-images \
	--audio-from "laundry shuffle short vers 2024-10-17 0433.flac" \
	--audio-animate "feedthrough=L 0 0.8 L 1/32 0.8 L 1/16 0.0 L 1 0.0; \
									 denoise=L 0 0.25 L 1/32 0.25 L 4/16 0.75 L 1 0.50; \
									 guidance=L 0 3.0 L 1/16 3.0 L 2/16 9.0 L 1 7.0"

Examples of `--controlnet refxl` mode

./venv/bin/python3 controlnetvideo.py examples/PXL_20240827_063831973.TS.mp4 examples/outh-2.mp4 --prompt kowloon\ walled\ city\ manifold\ garden\ pixel\ perfect\ anton\ fadeev\ studio\ ghibli\ miyazaki\ city\ streets --dump-frames progress.png --show-input --show-output --show-motion --color-info --motion-sigma 0.1 --motion-alpha 0.1</b> --color-fix none --feedthrough-strength 0.08 --swap-images --init-image-strength 0.60 --controlnet refxl

./venv/bin/python3 controlnetvideo.py examples/PXL_20240827_063831973.TS.mp4 examples/outh-3.mp4 --prompt manifold\ garden\ cityscape\ billowing\ clouds\ of\ thick\ clored\ smoke\ anton\ fadeev\ studio\ ghibli\ miyazaki\ city\ streets --dump-frames progress.png --show-input --show-output --show-motion --color-info --motion-sigma 0.1 --motion-alpha 0.1</b> --color-fix none <b>--feedthrough-strength 0.2 --swap-images --init-image-strength 0.53 --controlnet refxl

These videos were made with the --controlnet refxl option, which is an implementation of reference-only control for sdxl img2img. Effects are interesting. Needs more experimentation.

Installation

Pre-requisites

First, clone this repo using git

git clone https://github.com/un1tz3r0/controlnetvideo.git
cd controlnetvideo

You may wish to set up a venv. This is not strictly nescessary, so skip this step to use user/system-wide packages at your own risk, may break other projects whose dependencies are out of date.

python3 -m venv venv
source venv/bin/activate

Dependecies

Now, install the dependencies using pip3:

pip3 install -r requirements.txt

You should now be ready to run the script and process video files. If you are having trouble getting it working, open an issue or reach out on twitter or discord...

Example 1

To process a video using Stable Diffusion 2.1 and a ControlNet trained for depth-to-image generation:

python3 controlnetvideo.py \
	PXL_20230422_013745844.TS.mp4 \
	--controlnet depth21 \
	--prompt 'graffuturism colorful intricate heavy detailed outlines' \
	--prompt-strength 9 \
	--show-input \
	--show-detector \
	--show-motion \
	--dump-frames '{instem}_frames/{n:08d}.png' \
	--init-image-strength 0.4 \
	--color-amount 0.3 \
	--feedthrough-strength 0.001 \
	--show-output \
	--num-inference-steps 15 \
	--duration 60.0 \
	--start-time 10.0 \
	--skip-dumped-frames \
	'{instem}_out.mp4'

This will process the file PXL_20230422_013745844.TS.mp4, starting at 10 seconds for a duration of 60 seconds. It will process each input frame with some preprocessing (motion transfer/compensation of the output feedback), followed by a detector and diffusion models in a pipeline configured by the --controlnet option. Here, we are using depth21, which selects Midas depth estimator for the detector and the Stable Diffusion 2.1 model and the matching pretrained ControlNet model, in this case courtesy of thibaud, for 15 steps the first frame and (1.0-0.4)*15 => 9 steps (this is because img2img skips initial denoising steps according to the init-image strength) for the remaining frames. The diffusion pipeline will be run with the prompt 'graffuturism colorful intricate heavy detailed outlines' with a guidance strength of 9, and full controlnet influence.

During processing, it will show the input, the detector output, the motion estimate, and the output of each frame, by writing them to numbered .png image files in a directory PXL_20230422_013745844.TS_frames/ which will be created if it does not exist. If you just want a single image file you can watch with a viewer which auto-refreshes upon the file changing on disk, then you can specify the filename to --dump-frames without a {n} substitution, causing it to continually overwrite the same file. This is useful for watching the progress of the video processing in real time.

Finally, it will also encode and write the output to a video file PXL_20230422_013745844.TS_out.mp4.

PXL_20230422_013745844.TSb_out.mp4

Example 2

Here's another example of the same video, but with a different prompt and different parameters:

python3 controlnetvideo.py \
        PXL_20230419_205030311.TS.mp4 \
        --controlnet depth21 \
        --prompt 'mirrorverse colorful intricate heavy detailed outlines' \
        --prompt-strength 10 \
        --show-input \
        --show-detector \
        --show-motion \
        --dump-frames '{instem}_frames/{n:08d}.png' \
        --init-image-strength 0.525 \
        --color-amount 0.2 \
        --feedthrough-strength 0.0001 \
        --show-output \
        --num-inference-steps 16 \
        --skip-dumped-frames \
        --start-time 0.0 \
        '{instem}_out.mp4'

The frames dumped will look like this:

And the resulting output video:

PXL_20230419_205030311.TS_f_out.mp4

Tips

If your video comes out squashed or the wrong aspect ratio, try --no-fix-orientation or --fix-orientation. You can also mess with the scaling using --max-dimension and --min-dimension and --round-dims-to, although these should all be sane defaults that just work with most sources.
Feedback strength, set with --init-image-strength controls frame-to-frame consistency, by changing how much the motion-compensated previous output frame is fed into the next frame's diffusion pipeline in place of initial latent noise, a la img2img latent diffusion (citation needed). Values around 0.3 to 0.5 and sometimes much higher (closer to 1.0, the maximum which means no noise is added and no denoising steps will be run).
See --help (below) for more options, there are many features not covered here, such as:
- detector kwargs
- original input frame feedthrough strength
- motion estimate spatiotemporal smoothing (crude, simple exponential and gaussian filters, but it can help in some situations give better results if the motion estimate is noisy)
- color drift correction, and more.
- more things I forgot to mention

If there is interest, I will write up a more detailed guide to the options and how to use them. Usage: controlnetvideo.py [OPTIONS] INPUT_VIDEO OUTPUT_VIDEO

Usage

Options:
  --overwrite / --no-overwrite    don't overwrite existing output file -- add
                                  a numeric suffix to get a unique filename.
                                  default: --no-overwrite.
  --start-time FLOAT              start time in seconds
  --end-time FLOAT                end time in seconds
  --duration FLOAT                duration in seconds
  --output-bitrate TEXT           output bitrate for the video, e.g. '16M'
  --output-codec TEXT             output codec for the video, e.g. 'libx264'
  --max-dimension INTEGER         maximum dimension of the video
  --min-dimension INTEGER         minimum dimension of the video
  --round-dims-to INTEGER         round the dimensions to the nearest multiple
                                  of this number
  --fix-orientation / --no-fix-orientation
                                  resize videos shot in portrait mode on some
                                  devices to fix incorrect aspect ratio bug
  --no-audio                      don't include audio in the output video,
                                  even if the input video has audio
  --audio-from PATH               audio file to use for the output video,
                                  replaces the audio from the input video,
                                  will be truncated to duration of input or
                                  --duration if given. tempo and bars are
                                  analyzed and can be used to drive animation
                                  with the --audio-animate parameter.
  --audio-offset FLOAT            offset in seconds to start the audio from,
                                  when used with --audio-from
  --audio-animate TEXT            specify parameters and curves which should
                                  be animated according to the rhythm
                                  information detected in the soundtrack.
                                  format is: 'name=L x y C x y a b c d ...;
                                  name=...; ...', where name is an animatable
                                  parameter, L is a linear transition, and C
                                  is a cubic bezier curve, x is the position
                                  within a bar (four beats, starting on the
                                  downbeat) of the audio, and y is the
                                  parameter value at that point.
  --prompt TEXT                   prompt used to guide the denoising process
  --negative-prompt TEXT          negative prompt, can be used to prevent the
                                  model from generating certain words
  --prompt-strength FLOAT         how much influence the prompt has on the
                                  output
  --num-inference-steps, --steps INTEGER
                                  number of inference steps, depends on the
                                  scheduler, trades off speed for quality.
                                  20-50 is a good range from fastest to best.
  --controlnet [refxl|fluxcanny|depthxl|aesthetic|lineart21|hed|hed21|canny|canny21|openpose|openpose21|depth|depth21|normal|mlsd]
                                  which pretrained model and controlnet type
                                  to use. the default, depthxl, uses the dpt
                                  depth estimator and controlnet with the sdxl
                                  base model
  --controlnet-strength FLOAT     how much influence the controlnet
                                  annotator's output is used to guide the
                                  denoising process
  --init-image-strength FLOAT     the init-image strength, or how much of the
                                  prompt-guided denoising process to skip in
                                  favor of starting with an existing image
  --feedthrough-strength FLOAT    the ratio of input to motion compensated
                                  prior output to feed through to the next
                                  frame
  --motion-alpha FLOAT            smooth the motion vectors over time, 0.0 is
                                  no smoothing, 1.0 is maximum smoothing
  --motion-sigma FLOAT            smooth the motion estimate spatially, 0.0 is
                                  no smoothing, used as sigma for gaussian
                                  blur
  --show-detector / --no-show-detector
                                  show the controlnet detector output
  --show-input / --no-show-input  show the input frame
  --show-output / --no-show-output
                                  show the output frame
  --show-motion / --no-show-motion
                                  show the motion transfer (not implemented
                                  yet)
  --dump-frames PATH              write intermediate frame images to a
                                  file/files during processing to visualise
                                  progress. may contain various {}
                                  placeholders
  --skip-dumped-frames            read dumped frames from a previous run
                                  instead of processing the input video
  --dump-video                    write intermediate dump images to the final
                                  video instead of just the final output image
  --color-fix [none|rgb|hsv|lab]  prevent color from drifting due to feedback
                                  and model bias by fixing the histogram to
                                  the first frame. specify colorspace for
                                  histogram matching, e.g. 'rgb' or 'hsv' or
                                  'lab', or 'none' to disable.
  --color-amount FLOAT            blend between the original color and the
                                  color matched version, 0.0-1.0
  --color-info                    print extra stats about the color content of
                                  the output to help debug color drift issues
  --canny-low-thr FLOAT           canny edge detector lower threshold
  --canny-high-thr FLOAT          canny edge detector higher threshold
  --mlsd-score-thr FLOAT          mlsd line detector v threshold
  --mlsd-dist-thr FLOAT           mlsd line detector d threshold
  --swap-images                   Switch the init and reference images when
                                  using reference-only controlnet
  --help                          Show this message and exit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

controlnetvideo

Stable Diffusion Video2Video with Feedback

New! Features

Examples of `--animation-params ...` and `--audio-from ...`

Examples of `--controlnet refxl` mode

Installation

Pre-requisites

Dependecies

Example 1

Example 2

Tips

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

controlnetvideo

Stable Diffusion Video2Video with Feedback

New! Features

Examples of --animation-params ... and --audio-from ...

Examples of --controlnet refxl mode

Installation

Pre-requisites

Dependecies

Example 1

Example 2

Tips

Usage

Examples of `--animation-params ...` and `--audio-from ...`

Examples of `--controlnet refxl` mode