Motion-to-Attention: Attention Motion Composer using Optical Flow for Text-to-Video Editing

The existing T2V model failed to estimate an accurate attention map for the motion prompt {Floating}, which resulted in a decrease in editability, as shown in the top row of (b). (a) is a figure comparing the editability of Video-P2P and adding the proposed module to Video-P2P for input video. The proposed module improves the editability of existing video editing models through accurate estimation of attention maps. (b) briefly explains the method of enhancing the attention map by applying the proposed Motion-to-Attention module to address the limitations that the existing T2V model cannot accurately generate.

Abstract

Recent text-guided video editing research attempts to expand from image to video based on the text-guided image editing model. To this end, most research focuses on achieving temporal consistency between frames as a primary challenge in text-guided video editing. However, despite their efforts, the editability is still limited when there is a prompt indicating motion, such as "floating". In our experiment, we found that this phenomenon was due to the inaccurate attention map of the motion prompt. In this paper, we suggest the Motion-to-Attention (M2A) module to perform precise video editing by explicitly taking motion into account. First, we convert the optical flow extracted from the video into a motion map. During conversion, users can selectively apply direction information to extract the motion map. The proposed M2A module uses two methods: "Attention-Motion Swap", which directly replaces the motion map with the attention map, and "Attention-Motion Fusion", which uses the association between the motion map and the attention map, measured by a Fusion metric, as a weight to enhance the attention map using the motion map. The Text-to-Video editing model with the proposed M2A module showed better quantitative and qualitative results compared to the existing model.

Our Frameworks

The left side of the figure shows the overall framework of video editing by enhancing the attention map. First, the Text-to-Video (T2V) Model generates an attention map by receiving video and prompts as input. Simultaneously, the optical flow estimation model estimates the optical flow from the input video frames. The estimated optical flow is converted to a motion map by default using only magnitude information. Optionally, when direction information is provided by the user, the Direction Control converts the optical flow to a motion map that only shows movement in the user-specified direction. If the user indicates directional words with \textbf{[]}, the model captures the direction information and performs Direction Control. Then, the motion map is injected into the attention map of the T2V-Model in two ways from the M2A module: Attention-Motion Swap and Attention Motion Fusion. After that, text-to-video editing is performed using the attention map enhanced by the motion map. The right side of the figure shows how the Attention-Motion Swap and Attention-Motion Fusion of the M2A module enhance the attention map with the motion map.

Experimental Results

You can find more experimental results on our project page.

Input Video	Video-P2P	Ours
"Clouds {flowing} under a skyscraper"	"Waves {flowing} under a skyscraper"	"Waves {flowing} under a skyscraper"

"Clouds {flowing} on the mountain"	"Lava {flowing} on the mountain"	"Lava {flowing} on the mountain"

"{Spinning} wings of windmill are beside the river"	"Yellow {spinning} wings of windmill are beside the river"	"Yellow {spinning} wings of windmill are beside the river"

Setup

The environment is very similar to Video-P2P.

The versions of the packages we installed are:

torch: 1.12.1
xformers: 0.0.15.dev0+0bad001.d20230712

In the case of xformers, I installed it through the link introduced by Video-P2P.

pip install -r requirements.txt

Weights

We use the pre-trained stable diffusion model. You can download it here.

Quickstart

Since we developed our codes based on Video-P2P codes, you could refer to their github, if you need.

Please replace pretrained_model_path with the path to your stable-diffusion.

To download the pre-trained model, please refer to diffusers.

# Stage 1: Tuning to do model initialization.

# You can minimize the tuning epochs to speed up.
python run_tuning.py  --config="configs/cloud-1-tune.yaml"

# Stage 2: Attention Control

python run_motion_to_attention.py --config="configs/cloud-1-p2p.yaml" --motion_prompt "Please enter motion prompt"

# If the prompt is "clouds flowing under a skyscraper", the motion prompt is "flowing".
# You can input the motion prompt as below.

python run_motion_to_attention.py --config="configs/cloud-1-p2p.yaml" --motion_prompt "flowing"

Find your results in Video-P2P/outputs/xxx/results.

Acknowledgements

This repository borrows heavily from Video-P2P. Thanks to the authors for sharing their code and models.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
configs		configs
data		data
gradio_utils		gradio_utils
results		results
tmp		tmp
tuneavideo		tuneavideo
unimatch		unimatch
README.md		README.md
app_gradio.py		app_gradio.py
motion_pre_process.py		motion_pre_process.py
ptp_utils.py		ptp_utils.py
requirements.txt		requirements.txt
run_motion_to_attention.py		run_motion_to_attention.py
run_motion_to_attention_8_frame.py		run_motion_to_attention_8_frame.py
run_tuning.py		run_tuning.py
seq_aligner.py		seq_aligner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motion-to-Attention: Attention Motion Composer using Optical Flow for Text-to-Video Editing

Abstract

Our Frameworks

Experimental Results

You can find more experimental results on our project page.

Setup

Weights

Quickstart

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

cvsp-lab/M2A

Folders and files

Latest commit

History

Repository files navigation

Motion-to-Attention: Attention Motion Composer using Optical Flow for Text-to-Video Editing

Abstract

Our Frameworks

Experimental Results

You can find more experimental results on our project page.

Setup

Weights

Quickstart

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages