This python script is a command line tool for rerendering videos with stable diffusion models, making use of huggingface diffusers library and various other open source projects. It attempts to solve the problem of frame-to-frame consistency by various methods, primarily motion transfer from dense optical flow of the input to the output, fed back either using controlnets or reference only attention coupling.
This is not the right way to do this, clearly the models need to be extended to allow temporal information passing between frames and trained on video datasets, which, in the time since this scripts first incarnation, has proven very effective. I maintain this script and the techniques it uses as a curious and aesthetically interesting aside. Enjoy
by Victor Condino un1tz3r0@gmail.com, Oct 17 2024
- Supports SDXL Reference Only (ADAIN) (best results) and ControlNet (experimental)
- Supports SDXL ControlNets
- Music video beat-synced animation
- Animation with arbitrary piecewise cubic spline curves
- Flux.1 (initial support, works, only canny controlnet supported)
I've added some support for Flux.1, and extensive SDXL support; both controlnet and reference-only control schemes.
Also new is an animation system that can sync parameter modulations to the beat of an audio file. This uses the madmom to analyze the audio, and allows specifying piecewise cubic bezier curves to modulate various parameters of the video processing in time with the downbeats detected in the audio track. (Examples coming soon.)
controlnetvideo.py \
~/Downloads/PXL_20240922_094238715.TS.mp4 \
outw.mp4 \
--prompt "by takashi murakami" \
--dump-frames progress.png \
--show-input \
--show-output \
--show-motion \
--color-info \
--motion-sigma 3.0 \
--motion-alpha 0.5 \
--color-fix none \
--feedthrough-strength 0.00 \
--init-image-strength 0.35 \
--controlnet refxl \
--swap-images \
--audio-from "laundry shuffle short vers 2024-10-17 0433.flac" \
--audio-animate "feedthrough=L 0 0.8 L 1/32 0.8 L 1/16 0.0 L 1 0.0; \
denoise=L 0 0.25 L 1/32 0.25 L 4/16 0.75 L 1 0.50; \
guidance=L 0 3.0 L 1/16 3.0 L 2/16 9.0 L 1 7.0"
./venv/bin/python3 controlnetvideo.py examples/PXL_20240827_063831973.TS.mp4 examples/outh-2.mp4 --prompt kowloon\ walled\ city\ manifold\ garden\ pixel\ perfect\ anton\ fadeev\ studio\ ghibli\ miyazaki\ city\ streets --dump-frames progress.png --show-input --show-output --show-motion --color-info --motion-sigma 0.1 --motion-alpha 0.1</b> --color-fix none --feedthrough-strength 0.08 --swap-images --init-image-strength 0.60 --controlnet refxl
./venv/bin/python3 controlnetvideo.py examples/PXL_20240827_063831973.TS.mp4 examples/outh-3.mp4 --prompt manifold\ garden\ cityscape\ billowing\ clouds\ of\ thick\ clored\ smoke\ anton\ fadeev\ studio\ ghibli\ miyazaki\ city\ streets --dump-frames progress.png --show-input --show-output --show-motion --color-info --motion-sigma 0.1 --motion-alpha 0.1</b> --color-fix none <b>--feedthrough-strength 0.2 --swap-images --init-image-strength 0.53 --controlnet refxl
These videos were made with the --controlnet refxl
option, which is an implementation of reference-only control for sdxl img2img. Effects are interesting. Needs more experimentation.
First, clone this repo using git
git clone https://github.com/un1tz3r0/controlnetvideo.git
cd controlnetvideo
You may wish to set up a venv. This is not strictly nescessary, so skip this step to use user/system-wide packages at your own risk, may break other projects whose dependencies are out of date.
python3 -m venv venv
source venv/bin/activate
Now, install the dependencies using pip3:
pip3 install -r requirements.txt
You should now be ready to run the script and process video files. If you are having trouble getting it working, open an issue or reach out on twitter or discord...
To process a video using Stable Diffusion 2.1 and a ControlNet trained for depth-to-image generation:
python3 controlnetvideo.py \
PXL_20230422_013745844.TS.mp4 \
--controlnet depth21 \
--prompt 'graffuturism colorful intricate heavy detailed outlines' \
--prompt-strength 9 \
--show-input \
--show-detector \
--show-motion \
--dump-frames '{instem}_frames/{n:08d}.png' \
--init-image-strength 0.4 \
--color-amount 0.3 \
--feedthrough-strength 0.001 \
--show-output \
--num-inference-steps 15 \
--duration 60.0 \
--start-time 10.0 \
--skip-dumped-frames \
'{instem}_out.mp4'
This will process the file PXL_20230422_013745844.TS.mp4
, starting at 10 seconds
for a duration of 60 seconds
. It will process each input frame with some preprocessing (motion transfer/compensation of the output feedback), followed by a detector and diffusion models in a pipeline configured by the --controlnet
option. Here, we are using depth21
, which selects Midas depth estimator for the detector and the Stable Diffusion 2.1 model and the matching pretrained ControlNet model, in this case courtesy of thibaud, for 15
steps the first frame and (1.0-0.4)*15 => 9
steps (this is because img2img skips initial denoising steps according to the init-image strength) for the remaining frames. The diffusion pipeline will be run with the prompt 'graffuturism colorful intricate heavy detailed outlines'
with a guidance strength of 9
, and full controlnet influence.
During processing, it will show the input, the detector output, the motion estimate, and the output of each frame, by writing them to numbered .png
image files in a directory PXL_20230422_013745844.TS_frames/
which will be created if it does not exist. If you just want a single image file you can watch with a viewer which auto-refreshes upon the file changing on disk, then you can specify the filename to --dump-frames
without a {n}
substitution, causing it to continually overwrite the same file. This is useful for watching the progress of the video processing in real time.
Finally, it will also encode and write the output to a video file PXL_20230422_013745844.TS_out.mp4
.
PXL_20230422_013745844.TSb_out.mp4
Here's another example of the same video, but with a different prompt and different parameters:
python3 controlnetvideo.py \
PXL_20230419_205030311.TS.mp4 \
--controlnet depth21 \
--prompt 'mirrorverse colorful intricate heavy detailed outlines' \
--prompt-strength 10 \
--show-input \
--show-detector \
--show-motion \
--dump-frames '{instem}_frames/{n:08d}.png' \
--init-image-strength 0.525 \
--color-amount 0.2 \
--feedthrough-strength 0.0001 \
--show-output \
--num-inference-steps 16 \
--skip-dumped-frames \
--start-time 0.0 \
'{instem}_out.mp4'
The frames dumped will look like this:
And the resulting output video:
PXL_20230419_205030311.TS_f_out.mp4
-
If your video comes out squashed or the wrong aspect ratio, try
--no-fix-orientation
or--fix-orientation
. You can also mess with the scaling using--max-dimension
and--min-dimension
and--round-dims-to
, although these should all be sane defaults that just work with most sources. -
Feedback strength, set with
--init-image-strength
controls frame-to-frame consistency, by changing how much the motion-compensated previous output frame is fed into the next frame's diffusion pipeline in place of initial latent noise, a la img2img latent diffusion (citation needed). Values around 0.3 to 0.5 and sometimes much higher (closer to 1.0, the maximum which means no noise is added and no denoising steps will be run). -
See --help (below) for more options, there are many features not covered here, such as:
- detector kwargs
- original input frame feedthrough strength
- motion estimate spatiotemporal smoothing (crude, simple exponential and gaussian filters, but it can help in some situations give better results if the motion estimate is noisy)
- color drift correction, and more.
- more things I forgot to mention
If there is interest, I will write up a more detailed guide to the options and how to use them. Usage: controlnetvideo.py [OPTIONS] INPUT_VIDEO OUTPUT_VIDEO
Options:
--overwrite / --no-overwrite don't overwrite existing output file -- add
a numeric suffix to get a unique filename.
default: --no-overwrite.
--start-time FLOAT start time in seconds
--end-time FLOAT end time in seconds
--duration FLOAT duration in seconds
--output-bitrate TEXT output bitrate for the video, e.g. '16M'
--output-codec TEXT output codec for the video, e.g. 'libx264'
--max-dimension INTEGER maximum dimension of the video
--min-dimension INTEGER minimum dimension of the video
--round-dims-to INTEGER round the dimensions to the nearest multiple
of this number
--fix-orientation / --no-fix-orientation
resize videos shot in portrait mode on some
devices to fix incorrect aspect ratio bug
--no-audio don't include audio in the output video,
even if the input video has audio
--audio-from PATH audio file to use for the output video,
replaces the audio from the input video,
will be truncated to duration of input or
--duration if given. tempo and bars are
analyzed and can be used to drive animation
with the --audio-animate parameter.
--audio-offset FLOAT offset in seconds to start the audio from,
when used with --audio-from
--audio-animate TEXT specify parameters and curves which should
be animated according to the rhythm
information detected in the soundtrack.
format is: 'name=L x y C x y a b c d ...;
name=...; ...', where name is an animatable
parameter, L is a linear transition, and C
is a cubic bezier curve, x is the position
within a bar (four beats, starting on the
downbeat) of the audio, and y is the
parameter value at that point.
--prompt TEXT prompt used to guide the denoising process
--negative-prompt TEXT negative prompt, can be used to prevent the
model from generating certain words
--prompt-strength FLOAT how much influence the prompt has on the
output
--num-inference-steps, --steps INTEGER
number of inference steps, depends on the
scheduler, trades off speed for quality.
20-50 is a good range from fastest to best.
--controlnet [refxl|fluxcanny|depthxl|aesthetic|lineart21|hed|hed21|canny|canny21|openpose|openpose21|depth|depth21|normal|mlsd]
which pretrained model and controlnet type
to use. the default, depthxl, uses the dpt
depth estimator and controlnet with the sdxl
base model
--controlnet-strength FLOAT how much influence the controlnet
annotator's output is used to guide the
denoising process
--init-image-strength FLOAT the init-image strength, or how much of the
prompt-guided denoising process to skip in
favor of starting with an existing image
--feedthrough-strength FLOAT the ratio of input to motion compensated
prior output to feed through to the next
frame
--motion-alpha FLOAT smooth the motion vectors over time, 0.0 is
no smoothing, 1.0 is maximum smoothing
--motion-sigma FLOAT smooth the motion estimate spatially, 0.0 is
no smoothing, used as sigma for gaussian
blur
--show-detector / --no-show-detector
show the controlnet detector output
--show-input / --no-show-input show the input frame
--show-output / --no-show-output
show the output frame
--show-motion / --no-show-motion
show the motion transfer (not implemented
yet)
--dump-frames PATH write intermediate frame images to a
file/files during processing to visualise
progress. may contain various {}
placeholders
--skip-dumped-frames read dumped frames from a previous run
instead of processing the input video
--dump-video write intermediate dump images to the final
video instead of just the final output image
--color-fix [none|rgb|hsv|lab] prevent color from drifting due to feedback
and model bias by fixing the histogram to
the first frame. specify colorspace for
histogram matching, e.g. 'rgb' or 'hsv' or
'lab', or 'none' to disable.
--color-amount FLOAT blend between the original color and the
color matched version, 0.0-1.0
--color-info print extra stats about the color content of
the output to help debug color drift issues
--canny-low-thr FLOAT canny edge detector lower threshold
--canny-high-thr FLOAT canny edge detector higher threshold
--mlsd-score-thr FLOAT mlsd line detector v threshold
--mlsd-dist-thr FLOAT mlsd line detector d threshold
--swap-images Switch the init and reference images when
using reference-only controlnet
--help Show this message and exit.