AutoMV is a training-free, multi-agent system that automatically generates coherent, long-form music videos (MVs) directly from a full-length song.
The pipeline integrates music signal analysis, scriptwriting, character management, adaptive video generation, and multimodal verificationβaiming to make high-quality MV production accessible and scalable.
This repository corresponds to the paper:
AutoMV: An Automatic Multi-Agent System for Music Video Generation
AutoMV is designed as a full music-to-video (M2V) production workflow with strong music-aware reasoning abilities.
- Beat tracking, structure segmentation (SongFormer)
- Vocal/accompaniment separation (htdemucs)
- Automatic lyrics transcription with timestamps (Whisper)
- Music captioning (genre, mood, vocalist attributes) using Qwen2.5-Omni
- Screenwriter Agent: creates narrative descriptions, scene summaries, character settings
- Director Agent: produces shot-level scripts, camera instructions, and prompts
- Verifier Agent: checks physical realism, instruction following, and character consistency
- A structured database describing each characterβs:
face, hair, skin tone, clothing, gender, age, etc. - Ensures stable identity across multiple shots and scenes
- Doubao Video API: general cinematic shots
- Qwen-Wan 2.2: lip-sync shots using vocal stems
- Keyframe-guided generation with cross-shot continuity
Includes 12 fine-grained criteria under 4 categories:
- Technical
- Post-production
- Content
- Art
Evaluated via LLM judges (Gemini-2.5-Pro/Flash) and human experts.
AutoMV consists of four main stages:
- Music Preprocessing
- Screenwriter & Director Agents
- Keyframe + Video Clip Generation
- Gemini Verifier & Final Assembly
A detailed architecture diagram is available in the paper.
AutoMV is a training-free system, relying on MIR tools and LLM/VLM APIs.
git clone https://github.com/multimodal-art-projection/AutoMV.git
cd AutoMVpip install -r SongFormer_requirements.txt
conda install -c conda-forge ffmpeg
pip install -r requirements.txtDependencies include:
ffmpeghtdemucswhisperpydub- SDKs for Gemini, Doubao, Qwen, etc.
This information has been organized and translated into English Markdown format:
Export the following Environment Variables in your shell profile (e.g., .bashrc, .zshrc) or set them as environment variables before running the project:
GEMINI_API_KEY=xxx
DOUBAO_API_KEY=xxx
ALIYUN_OSS_ACCESS_KEY_ID=xxx # Aliyun OSS Access Key ID
ALIYUN_OSS_ACCESS_KEY_SECRET=xxx # Aliyun OSS Access Key Secret
ALIYUN_OSS_BUCKET_NAME=xxx # Aliyun OSS Bucket Name
HUOSHAN_ACCESS_KEY=xxx # Huoshan Engine ACCESS KEY
HUOSHAN_SECRET_KEY=xxx # Huoshan Engine SECRET KEY
GPU_ID=xxx # Optional
WHISPER_MODEL=xxx
QWEN_OMNI_MODEL=xxxBefore running the project, download the following pretrained models:
-
Qwen2.5-Omni-7B
- Download Source: ModelScope
- Link: https://modelscope.cn/models/qwen/Qwen2.5-Omni-7B
-
Whisper Large-v2
- Installation & Usage Instructions:
- Link: https://github.com/openai/whisper
-
Wan2.2-s2v (Optional)
Note: This model is for local lip-synced video generation. Processing a single song typically requires 4-5 hours on an A800 GPU, but it is significantly cheaper than using API calls.
-
Model Setup:
- Navigate to the lip-sync directory:
cd generate_lip_video - Clone the model repository:
git clone https://huggingface.co/Wan-AI/Wan2.2-S2V-14B - Environment Setup (Mandatory due to conflicts):
A new environment is required for the local model due to potential package conflicts.
bash conda create -n gen_lip python=3.10 conda activate gen_lip pip install requirements.txt pip install requirements_s2v.txt - Code Modification:
Comment out the function call
gen_lip_sync_video_jimeng(music_video_name, config = Config)within the filegenerate_pipeline.py.
- Testing/Execution Steps (Once config setup is complete):
# 1. Navigate to the picture generation directory: cd picture_generate # 2. Run the picture generation script: python picture.py # 3. Run the lip-sync generation script: python generate_lip_video/gen_lip_sycn_video.py # 4. Run the main pipeline: python generate_pipeline.py - Navigate to the lip-sync directory:
After downloading the models, specify their paths in config.py:
MODEL_PATH_QWEN = "/path/to/Qwen2.5-Omni-7B"
WHISPER_MODEL_PATH = "/path/to/whisper-large-v2"Download Pre-trained Models
cd picture_generate/SongFormer/src/SongFormer
# For users in mainland China, you may need export HF_ENDPOINT=https://hf-mirror.com
python utils/fetch_pretrained.pyPlace your .mp3 or .wav file into:
./result/{music_name}/{music_name}.mp3In config.py, replace {music_name} with the identifier of your music project.
This name will be used as the directory name for storing all intermediate and final outputs.
Please use only English letters, numbers, or underscores in the name.
For users in mainland China, you may need export HF_ENDPOINT=https://hf-mirror.com
(1) Generate the first-frame images for each MV segment
python -m picture_generate.mainThis step:
- Generates visual prompts for each segment
- Produces keyframe images
- Saves results under result/{music_name}/picture/
(2) Generate the complete music video
python generate_pipeline.pyThis step:
- Generates all video clips using storyboard + camera scripts + keyframes
- Merges clips into a final MV
- Saves the result as result/{music_name}/mv_{music_name}.mp4
After running the full pipeline, the output directory will contain:
result/{music_name}/
βββ camera/ # Camera directions for each MV segment
βββ output/ # Generated video clips for each segment
βββ picture/ # First-frame images of each MV segment
βββ piece/ # Audio segments cut from the original song
βββ {music_name}_vocals.wav # Separated vocal audio (optional)
βββ {music_name}.mp3 # The full original audio
βββ label.json # Character Bank
βββ mv_{music_name}.mp4 # The final generated music video
βββ name.txt # Full name of the song
βββ story.json # Complete MV storyboardWe evaluate AutoMV with:
- ImageBind Score (IB) β cross-modal similarity between audio and visual content The relevant code is in evaluate/IB.
Using multimodal LLMs (Gemini-2.5-Pro/Flash) to score:
- Technical quality
- Post-production
- Music content alignment
- Artistic quality
The relevant code is in evaluate/LLM.
Music producers, MV directors, and industry practitioners scored each sub-criterion (1β5).
On a benchmark of 30 professionally released songs, AutoMV outperforms existing commercial systems:
| Method | Cost | Time | IB β | Human Score β |
|---|---|---|---|---|
| Revid.ai-base | ~$10 | 5β10min | 19.9 | 1.06 |
| OpenArt-story | $20β40 | 10β20min | 18.5 | 1.45 |
| AutoMV (ours) | $10β20 | ~30min | 24.4 | 2.42 |
| Human (experts) | β₯$10k | Weeks | 24.1 | 2.90 |
AutoMV greatly improves:
- Character consistency
- Shot continuity
- Audioβvisual correlation
- Storytelling & theme relevance
- Overall coherence of long-form MVs
If you use AutoMV in your research, please cite:
@misc{tang2025automv,
title={AutoMV: An Automatic Multi-Agent System for Music Video Generation},
author={Tang, Xiaoxuan and Lei, Xinping and Zhu, Chaoran and Chen, Shiyun and Yuan, Ruibin and Li, Yizhi and Oh, Changjae and Zhang, Ge and Huang, Wenhao and Benetos, Emmanouil and Liu, Yang and Liu, Jiaheng and Ma, Yinghao},
year={2025},
eprint={2512.12196},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2512.12196},
}This project is released under the Apache 2.0 License.
AutoMV builds on: