OmniGAIA: Towards Native Omni-Modal AI Agents

🤗 Omnimodal-Agent-SFT-2K ｜ 🤗 OmniAtlas-3B ｜ 🤗 OmniAtlas-7B ｜ 🤗 OmniAtlas-30B-A3B

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

[Feb 27, 2026]: 📄 Our paper is now available on arXiv and Hugging Face.
[Feb 27, 2026]: 📈 Our OmniGAIA benchmark and OmniAtlas models are now available on Hugging Face.
[Feb 27, 2026]: 🚀 Full codebase released. You can now deploy omni-modal AI agents for images, audio, and video, along with research tools.

🎬 Demo

1. Agentic Reasoning on "Image + Audio" Scenario

demo_image_audio.mp4

2. Agentic Reasoning on "Video w/ Audio" Scenario

demo_video.mp4

💡 Overview

OmniGAIA is a comprehensive benchmark designed to evaluate the capabilities of omni-modal general AI assistants. Unlike existing benchmarks that focus on a single modality, OmniGAIA requires agents to jointly reason over video, audio, and image inputs while leveraging external tools such as web search and code execution.

We also introduce OmniAtlas, an agentic reasoning system that extends a base LLM with active perception tools, enabling the model to request and examine additional media segments during multi-step reasoning.

🎯 Task Examples

📊 Benchmark Construction

The OmniGAIA construction pipeline consists of four stages:

Data Collection — Curating video (with audio) and image+audio sources from FineVideo, LongVideoBench, LongVideo-Reason, COCO 2017, and HuggingFace, covering 100+ diverse domains.
Valuable Information Discovery — Using Gemini-3-Flash to extract events, environmental analysis, audio analysis (ASR, speaker ID), and image understanding (OCR, objects, faces).
Agentic Omni-Modal Event Graph Construction — DeepSeek-V3.2 iteratively expands an initial event graph by planning next steps, acquiring new information via tools, and verifying factual correctness with LLM self-reflexion and human review.
QA Generation & Quality Review — Generating difficult, multi-hop QA pairs through event fuzzification, followed by LLM and human verification for correctness, task difficulty, answer uniqueness.

📈 Benchmark Statistics

Core statistics:

360 QA pairs across 9 domains (Geography, History, Technology, Sports, Arts, Movies, Science, Finance, Food)
3 difficulty levels — Easy (33.9%), Medium (44.4%), Hard (21.7%)
Median video duration: 242.2s | Median audio duration: 197.0s
98.6% require web search; 74.4% require code / computation

🤖 OmniAtlas Training Pipeline

OmniAtlas is trained in two stages:

Trajectory Synthesis & Supervised Learning — Gemini-3 provides step supervision while DeepSeek-V3.2 performs tool-augmented reasoning. Successful trajectories are used for SFT.
OmniDPO: Fine-Grained Error Correction — Gemini-3 identifies and corrects errors in failed trajectories across perception, reasoning, and tool-use dimensions, producing preference pairs for DPO training.

🔧 Installation

Environment Setup

# Create conda environment
conda create -n omnigaia python=3.10
conda activate omnigaia

# Clone the repository
git clone https://github.com/RUC-NLPIR/OmniGAIA.git
cd OmniGAIA

# Install dependencies
pip install -r requirements.txt

System Dependencies

ffmpeg is required for video/audio processing in OmniAtlas:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows (via Chocolatey)
choco install ffmpeg

Configuration File

All runtime configuration is managed via config/config.json, where you need to set:

Main agent endpoint (agent.api_base_url, agent.api_key, agent.model_name)
Evaluation LLM endpoint (evaluation.base_url, evaluation.api_key, evaluation.model)
Web search API key (web_tools.serper_api_key)
Jina API key (web_tools.jina_api_key)

🏃 Quick Start

Pre-preparation

1. Model Serving

Before running agents, ensure your LLM and auxiliary models are served via an OpenAI-compatible API (e.g. using vLLM, SGLang, or a cloud API):

# Example: serve OmniAtlas-Qwen3-30B-A3B with vLLM
vllm serve /path/to/your/OmniAtlas-Qwen3-30B-A3B \
    --served-model-name omniatlas-30b \
    --port 8080 \
    --host 0.0.0.0 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --uvicorn-log-level debug \
    --max-model-len 65536

2. Benchmark Data

Place the benchmark JSON and media files under the data/ directory:

data/
├── test_metadata.json      # Benchmark questions
├── videos/                 # Video files referenced in questions
├── audios/                 # Audio files referenced in questions
└── images/                 # Image files referenced in questions

test_metadata.json can be downloaded from: https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA/blob/main/raw/test_metadata.json
videos/, audios/, and images/ can be downloaded from: https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA/tree/main/data_media_test

Running the Baseline Agent

The baseline agent supports both Gemini and Qwen model families. The model family is auto-detected from the --model_name argument.

# ── Run with Gemini ──────────────────────────────────────────────
python src/run_base_agent.py \
    --input_file ./data/test_metadata.json \
    --api_base_url "https://your-gemini-endpoint/v1" \
    --model_name "gemini-3-flash" \
    --api_key "YOUR_API_KEY" \
    --concurrent_limit 16

# ── Run with Qwen (OpenAI-compatible endpoint) ──────────────────
python src/run_base_agent.py \
    --input_file ./data/test_metadata.json \
    --api_base_url "http://localhost:8000/v1" \
    --model_name "qwen3-omni-30b-a3b-thinking" \
    --api_key "empty" \
    --concurrent_limit 16

Parameters:

Parameter	Description
`--input_file`	Path to the benchmark JSON file
`--api_key`	API key for the model endpoint
`--api_base_url`	Base URL of the model API
`--model_name`	Model identifier (auto-selects Gemini vs Qwen agent)
`--level`	Filter by difficulty: `Easy`, `Medium`, or `Hard`
`--max_items`	Limit the number of items to process
`--concurrent_limit`	Maximum concurrent API calls (default: 5)
`--max_action_limit`	Maximum number of tool-call turns before forced final answer (default: 50)
`--use_asr`	Use Whisper ASR to convert audio to text (for text-only models)
`--enable-active-perception`	Enable `read_video` / `read_audio` / `read_image` tools (Qwen/OmniAtlas models only)
`--output_dir`	Directory for results (default: `./outputs`)
`--request_timeout`	Per-request timeout in seconds (default: 600)
`--forced_final_timeout`	Timeout for forced final answer after max turns (default: 300)
`--ffmpeg_timeout`	Timeout for ffmpeg-related media processing (default: 180)
`--item_timeout`	Max total processing time (default: 36000 (10 hours))
`--eval_timeout`	Timeout for LLM equivalence evaluation (default: 120)
`--skip_eval`	Skip LLM-based equivalence evaluation

Running OmniAtlas Agent Mode

OmniAtlas behavior is enabled in run_base_agent.py via --enable-active-perception (Qwen/OmniAtlas models only). This allows the model to request specific video/audio/image segments during reasoning:

python src/run_base_agent.py \
    --input_file ./data/test_metadata.json \
    --api_base_url "http://localhost:8000/v1" \
    --model_name "omniatlas-qwen-30b-a3b" \
    --api_key "empty" \
    --enable-active-perception \
    --concurrent_limit 16

📊 Evaluation

Automatic Evaluation

run_base_agent.py automatically evaluates results after generation. The evaluation includes:

Exact Match (EM): Normalised string comparison between the predicted answer and ground truth.
LLM Equivalence: An LLM judge (e.g. DeepSeek-V3) determines whether the predicted answer is semantically equivalent to the ground truth.

Results and metrics are saved to the outputs/ directory.

Re-evaluate Existing Results

To re-run evaluation on previously generated results (e.g. with a different evaluation model):

python src/evaluate/eval_results.py \
    --input_file ./outputs/base_agent_omniatlas-30b/run_20260101_120000_em0.2500_llmeq0.4000.json \
    --test_file_path ./data/test_metadata.json \
    --concurrent_limit 64

Parameters:

Parameter	Description
`--input_file`	Path to the results JSON from a previous run
`--test_file_path`	(Optional) Original test JSON to recover missing category labels
`--concurrent_limit`	Maximum concurrent evaluation API calls (default: 64)

Output Format

Each run produces two files:

run_<timestamp>_em<score>_llmeq<score>.json — Per-item results with predictions, messages, and scores.
run_<timestamp>_em<score>_llmeq<score>_metrics.json — Aggregated metrics (overall, by difficulty level, and by category).

Example metrics output:

==================================================
Total Items:            360
Average EM Score:       0.2500
Average LLM Equal Score:0.4000
Average Tool Calls:     6.50
Non-Empty Answer Ratio: 0.9800
--------------------
Easy     (n=122): EM=0.3500, LLM_Eq=0.5200
Medium   (n=160): EM=0.2300, LLM_Eq=0.3800
Hard     (n=78 ): EM=0.1400, LLM_Eq=0.2600
--------------------
Geo.  (n=69 ): EM=0.2800, LLM_Eq=0.4200
Tech. (n=49 ): EM=0.2600, LLM_Eq=0.4100
...
==================================================

🛠️ Tools

OmniGAIA agents are equipped with the following external tools:

Tool	Description	Key Dependencies
Web Search	Google search via Serper API with result caching	`aiohttp`, Serper API
Page Browser	Fetch and extract webpage content via Jina Reader API	`aiohttp`, `beautifulsoup4`, Jina API
Code Executor	Sandboxed Python execution with common scientific libraries	Built-in (`exec`/`eval`)
Active Perception (OmniAtlas only)	`read_video`, `read_audio`, `read_image` — request specific media segments during reasoning	`opencv-python`, `pydub`, `ffmpeg`

📄 Citation

If you find this work helpful, please kindly cite our paper:

@misc{li2026omnigaia,
      title={OmniGAIA: Towards Native Omni-Modal AI Agents}, 
      author={Xiaoxi Li and Wenxiang Jiao and Jiarui Jin and Shijian Wang and Guanting Dong and Jiajie Jin and Hao Wang and Yinuo Wang and Ji-Rong Wen and Yuan Lu and Zhicheng Dou},
      year={2026},
      eprint={2602.22897},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.22897}, 
}

📄 License

This project is released under the MIT License.

📞 Contact

For any questions or feedback, please reach out to us at xiaoxi_li@ruc.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
config		config
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniGAIA: Towards Native Omni-Modal AI Agents

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

🎬 Demo

1. Agentic Reasoning on "Image + Audio" Scenario

2. Agentic Reasoning on "Video w/ Audio" Scenario

💡 Overview

🎯 Task Examples

📊 Benchmark Construction

📈 Benchmark Statistics

🤖 OmniAtlas Training Pipeline

🔧 Installation

Environment Setup

System Dependencies

Configuration File

🏃 Quick Start

Pre-preparation

1. Model Serving

2. Benchmark Data

Running the Baseline Agent

Running OmniAtlas Agent Mode

📊 Evaluation

Automatic Evaluation

Re-evaluate Existing Results

Output Format

🛠️ Tools

📄 Citation

📄 License

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

OmniGAIA: Towards Native Omni-Modal AI Agents

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

🎬 Demo

1. Agentic Reasoning on "Image + Audio" Scenario

2. Agentic Reasoning on "Video w/ Audio" Scenario

💡 Overview

🎯 Task Examples

📊 Benchmark Construction

📈 Benchmark Statistics

🤖 OmniAtlas Training Pipeline

🔧 Installation

Environment Setup

System Dependencies

Configuration File

🏃 Quick Start

Pre-preparation

1. Model Serving

2. Benchmark Data

Running the Baseline Agent

Running OmniAtlas Agent Mode

📊 Evaluation

Automatic Evaluation

Re-evaluate Existing Results

Output Format

🛠️ Tools

📄 Citation

📄 License

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages