Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

This repository contains the official implementation of the paper Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding, which achieves the state-of-the-art performance by a large margin on multiple long video benchmarks including the challenging LVBench.

Update

2025/08/04: Upload captions on benchmarks for reproduction: LVBench, LVBench w/ transcripts, Video-MME, LongVideoBench and EgoSchema.
2025/08/02: Support auto subtitle in the demo.
2025/07/17: Add gradio demo.
2025/07/16: Add lite_mode to enable a lightweight version of the agent that uses only subtitles. Good for Youtube podcast analysis!
2025/07/14: Support OpenAI API and Azure OpenAI API.
2025/07/08: Initial release of the Deep Video Discovery codebase.

Introduction

Deep Video Discovery (DVD) is a deep-research style question answering agent designed for understanding extra-long videos. Leveraging the powerful capabilities of large language models (LLMs), DVD effectively interprets and processes extensive video content to answer complex user queries.

demo.mp4

DVD Achieves state-of-the-art performance by a large margin on multiple long video benchmarks using OpenAI o3.

The core design of DVD includes:

🎬 Treating segmented video clips as exploration environments
🤖 Autonomous planning and reasoning, dynamically formulating strategies to solve problems efficiently
🛠️ Selecting appropriate multi-granular tools, iteratively extracting relevant information from the video environment
📝 Summarizing and reflecting on observations, to provide comprehensive and accurate answers to user questions

Installation

Clone the repository:

git clone https://github.com/microsoft/deepvideodiscovery.git
cd DeepVideoDiscovery

Create a virtual environment and install the dependencies:
```
pip install -r requirements.txt
```
(Optional) Install gradio for demo:
```
pip install gradio
```

Usage

Note: Set up your configuration by updating the variables in config.py.

Example

The local_run.py script provides an example of how to run the Deep Video Discovery agent by providing a youtube url and question about it.

```bash
python local_run.py https://www.youtube.com/watch?v=PQFQ-3d2J-8 "what did the main speaker talk about in the last part of video?"
```

TODO

Support OpenAI API key configuration.
Implement MCP server.
Release evaluation trajectory data on long video benchmarks.

Changes

Compared to the original implementation, we have made the following changes:

Refactored the code for better readability and maintainability.
In global_browse_tool we leverage the textual description (rather than original video pixels) of multiple video clips to provide global overview of the video content to improve efficency.

Citation

If you find our work useful, please consider citing:

@article{zhang2025deep,
  title={Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding},
  author={Zhang, Xiaoyi and Jia, Zhaoyang and Guo, Zongyu and Li, Jiahao and Li, Bin and Li, Houqiang and Lu, Yan},
  journal={arXiv preprint arXiv:2505.18079},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
asset		asset
dvd		dvd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
local_run.py		local_run.py
mcp_server.py		mcp_server.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Update

Introduction

Installation

Usage

Example

TODO

Changes

Citation

About

Uh oh!

Uh oh!

Languages

License

microsoft/DeepVideoDiscovery

Folders and files

Latest commit

History

Repository files navigation

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Update

Introduction

Installation

Usage

Example

TODO

Changes

Citation

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages