AutoMV: Automatic Multi-Agent System for Music Video Generation

AutoMV is a training-free, multi-agent system that automatically generates coherent, long-form music videos (MVs) directly from a full-length song.
The pipeline integrates music signal analysis, scriptwriting, character management, adaptive video generation, and multimodal verification—aiming to make high-quality MV production accessible and scalable.

This repository corresponds to the paper:

AutoMV: An Automatic Multi-Agent System for Music Video Generation

🚀 Features

AutoMV is designed as a full music-to-video (M2V) production workflow with strong music-aware reasoning abilities.

🎼 Music Understanding and Preprocessing

Beat tracking, structure segmentation (SongFormer)
Vocal/accompaniment separation (htdemucs)
Automatic lyrics transcription with timestamps (Whisper)
Music captioning (genre, mood, vocalist attributes) using Qwen2.5-Omni

🎬 Multi-Agent Pipeline

Screenwriter Agent: creates narrative descriptions, scene summaries, character settings
Director Agent: produces shot-level scripts, camera instructions, and prompts
Verifier Agent: checks physical realism, instruction following, and character consistency

🧍 Character Bank

A structured database describing each character’s:
face, hair, skin tone, clothing, gender, age, etc.
Ensures stable identity across multiple shots and scenes

🎥 Adaptive Video Generation Backends

Doubao Video API: general cinematic shots
Qwen-Wan 2.2: lip-sync shots using vocal stems
Keyframe-guided generation with cross-shot continuity

🧪 Full-Song MV Benchmark (First of Its Kind)

Includes 12 fine-grained criteria under 4 categories:

Technical
Post-production
Content
Art

Evaluated via LLM judges (Gemini-2.5-Pro/Flash) and human experts.

🧩 System Overview

AutoMV consists of four main stages:

Music Preprocessing
Screenwriter & Director Agents
Keyframe + Video Clip Generation
Gemini Verifier & Final Assembly

A detailed architecture diagram is available in the paper.

📦 Installation

AutoMV is a training-free system, relying on MIR tools and LLM/VLM APIs.

1. Clone the repository

git clone https://github.com/multimodal-art-projection/AutoMV.git
cd AutoMV

2. Install dependencies

pip install -r SongFormer_requirements.txt
conda install -c conda-forge ffmpeg
pip install -r requirements.txt

Dependencies include:

ffmpeg
htdemucs
whisper
pydub
SDKs for Gemini, Doubao, Qwen, etc.

This information has been organized and translated into English Markdown format:

3. Add Environment Variables

Export the following Environment Variables in your shell profile (e.g., .bashrc, .zshrc) or set them as environment variables before running the project:

GEMINI_API_KEY=xxx
DOUBAO_API_KEY=xxx
ALIYUN_OSS_ACCESS_KEY_ID=xxx  # Aliyun OSS Access Key ID
ALIYUN_OSS_ACCESS_KEY_SECRET=xxx  # Aliyun OSS Access Key Secret
ALIYUN_OSS_BUCKET_NAME=xxx  # Aliyun OSS Bucket Name
HUOSHAN_ACCESS_KEY=xxx  # Huoshan Engine ACCESS KEY
HUOSHAN_SECRET_KEY=xxx  # Huoshan Engine SECRET KEY
GPU_ID=xxx  # Optional
WHISPER_MODEL=xxx
QWEN_OMNI_MODEL=xxx

4. Download Required Models

Before running the project, download the following pretrained models:

Qwen2.5-Omni-7B
- Download Source: ModelScope
- Link: https://modelscope.cn/models/qwen/Qwen2.5-Omni-7B
Whisper Large-v2
- Installation & Usage Instructions:
- Link: https://github.com/openai/whisper
Wan2.2-s2v (Optional)

Note: This model is for local lip-synced video generation. Processing a single song typically requires 4-5 hours on an A800 GPU, but it is significantly cheaper than using API calls.

Model Setup:
1. Navigate to the lip-sync directory: cd generate_lip_video
2. Clone the model repository: git clone https://huggingface.co/Wan-AI/Wan2.2-S2V-14B
3. Environment Setup (Mandatory due to conflicts): A new environment is required for the local model due to potential package conflicts. bash conda create -n gen_lip python=3.10 conda activate gen_lip pip install requirements.txt pip install requirements_s2v.txt
4. Code Modification: Comment out the function call gen_lip_sync_video_jimeng(music_video_name, config = Config) within the file generate_pipeline.py.
- Testing/Execution Steps (Once config setup is complete):
```
# 1. Navigate to the picture generation directory:
cd picture_generate
# 2. Run the picture generation script:
python picture.py
# 3. Run the lip-sync generation script:
python generate_lip_video/gen_lip_sycn_video.py
# 4. Run the main pipeline:
python generate_pipeline.py
```

After downloading the models, specify their paths in config.py:

MODEL_PATH_QWEN = "/path/to/Qwen2.5-Omni-7B"
WHISPER_MODEL_PATH = "/path/to/whisper-large-v2"

SongFormer

Download Pre-trained Models

cd picture_generate/SongFormer/src/SongFormer
# For users in mainland China, you may need export HF_ENDPOINT=https://hf-mirror.com
python utils/fetch_pretrained.py

🎧 Usage

1. Prepare your audio

Place your .mp3 or .wav file into:

./result/{music_name}/{music_name}.mp3

2. Run AutoMV

In config.py, replace {music_name} with the identifier of your music project.
This name will be used as the directory name for storing all intermediate and final outputs. Please use only English letters, numbers, or underscores in the name.

For users in mainland China, you may need export HF_ENDPOINT=https://hf-mirror.com

(1) Generate the first-frame images for each MV segment

python -m picture_generate.main

This step:

Generates visual prompts for each segment
Produces keyframe images
Saves results under result/{music_name}/picture/

(2) Generate the complete music video

python generate_pipeline.py

This step:

Generates all video clips using storyboard + camera scripts + keyframes
Merges clips into a final MV
Saves the result as result/{music_name}/mv_{music_name}.mp4

3. Output Directory Structure

After running the full pipeline, the output directory will contain:

result/{music_name}/
├── camera/                 # Camera directions for each MV segment
├── output/                  # Generated video clips for each segment
├── picture/                # First-frame images of each MV segment
├── piece/                   # Audio segments cut from the original song
├── {music_name}_vocals.wav  # Separated vocal audio (optional)
├── {music_name}.mp3         # The full original audio
├── label.json               # Character Bank
├── mv_{music_name}.mp4      # The final generated music video
├── name.txt                 # Full name of the song
└── story.json               # Complete MV storyboard

📊 Benchmark & Evaluation

We evaluate AutoMV with:

Objective Metric

ImageBind Score (IB) — cross-modal similarity between audio and visual content The relevant code is in evaluate/IB.

LLM-Based Evaluation (12 Criteria)

Using multimodal LLMs (Gemini-2.5-Pro/Flash) to score:

Technical quality
Post-production
Music content alignment
Artistic quality

The relevant code is in evaluate/LLM.

Human Expert Evaluation

Music producers, MV directors, and industry practitioners scored each sub-criterion (1–5).

🧪 Experimental Results

On a benchmark of 30 professionally released songs, AutoMV outperforms existing commercial systems:

Method	Cost	Time	IB ↑	Human Score ↑
Revid.ai-base	~$10	5–10min	19.9	1.06
OpenArt-story	$20–40	10–20min	18.5	1.45
AutoMV (ours)	$10–20	~30min	24.4	2.42
Human (experts)	≥$10k	Weeks	24.1	2.90

AutoMV greatly improves:

Character consistency
Shot continuity
Audio–visual correlation
Storytelling & theme relevance
Overall coherence of long-form MVs

📚 Citation

If you use AutoMV in your research, please cite:

@misc{tang2025automv,
      title={AutoMV: An Automatic Multi-Agent System for Music Video Generation}, 
      author={Tang, Xiaoxuan and Lei, Xinping and Zhu, Chaoran and Chen, Shiyun and Yuan, Ruibin and Li, Yizhi and Oh, Changjae and Zhang, Ge and Huang, Wenhao and Benetos, Emmanouil and Liu, Yang and Liu, Jiaheng and Ma, Yinghao},
      year={2025},
      eprint={2512.12196},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2512.12196}, 
}

📝 License

This project is released under the Apache 2.0 License.

🤝 Acknowledgements

AutoMV builds on:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
evaluate		evaluate
generate_lip_video		generate_lip_video
picture_generate		picture_generate
result/1		result/1
static		static
video_generate		video_generate
LICENSE		LICENSE
README.md		README.md
SongFormer_requirements.txt		SongFormer_requirements.txt
__init__.py		__init__.py
_config.yml		_config.yml
config.py		config.py
generate_pipeline.py		generate_pipeline.py
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoMV: Automatic Multi-Agent System for Music Video Generation

🚀 Features

🎼 Music Understanding and Preprocessing

🎬 Multi-Agent Pipeline

🧍 Character Bank

🎥 Adaptive Video Generation Backends

🧪 Full-Song MV Benchmark (First of Its Kind)

🧩 System Overview

📦 Installation

1. Clone the repository

2. Install dependencies

3. Add Environment Variables

4. Download Required Models

SongFormer

🎧 Usage

1. Prepare your audio

2. Run AutoMV

3. Output Directory Structure

📊 Benchmark & Evaluation

Objective Metric

LLM-Based Evaluation (12 Criteria)

Human Expert Evaluation

🧪 Experimental Results

📚 Citation

📝 License

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

multimodal-art-projection/AutoMV

Folders and files

Latest commit

History

Repository files navigation

AutoMV: Automatic Multi-Agent System for Music Video Generation

🚀 Features

🎼 Music Understanding and Preprocessing

🎬 Multi-Agent Pipeline

🧍 Character Bank

🎥 Adaptive Video Generation Backends

🧪 Full-Song MV Benchmark (First of Its Kind)

🧩 System Overview

📦 Installation

1. Clone the repository

2. Install dependencies

3. Add Environment Variables

4. Download Required Models

SongFormer

🎧 Usage

1. Prepare your audio

2. Run AutoMV

3. Output Directory Structure

📊 Benchmark & Evaluation

Objective Metric

LLM-Based Evaluation (12 Criteria)

Human Expert Evaluation

🧪 Experimental Results

📚 Citation

📝 License

🤝 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages