π€ Omnimodal-Agent-SFT-2K ο½ π€ OmniAtlas-3B ο½ π€ OmniAtlas-7B ο½ π€ OmniAtlas-30B-A3B
- [Feb 27, 2026]: π Our paper is now available on arXiv and Hugging Face.
- [Feb 27, 2026]: π Our OmniGAIA benchmark and OmniAtlas models are now available on Hugging Face.
- [Feb 27, 2026]: π Full codebase released. You can now deploy omni-modal AI agents for images, audio, and video, along with research tools.
demo_image_audio.mp4
demo_video.mp4
OmniGAIA is a comprehensive benchmark designed to evaluate the capabilities of omni-modal general AI assistants. Unlike existing benchmarks that focus on a single modality, OmniGAIA requires agents to jointly reason over video, audio, and image inputs while leveraging external tools such as web search and code execution.
We also introduce OmniAtlas, an agentic reasoning system that extends a base LLM with active perception tools, enabling the model to request and examine additional media segments during multi-step reasoning.
The OmniGAIA construction pipeline consists of four stages:
- Data Collection β Curating video (with audio) and image+audio sources from FineVideo, LongVideoBench, LongVideo-Reason, COCO 2017, and HuggingFace, covering 100+ diverse domains.
- Valuable Information Discovery β Using Gemini-3-Flash to extract events, environmental analysis, audio analysis (ASR, speaker ID), and image understanding (OCR, objects, faces).
- Agentic Omni-Modal Event Graph Construction β DeepSeek-V3.2 iteratively expands an initial event graph by planning next steps, acquiring new information via tools, and verifying factual correctness with LLM self-reflexion and human review.
- QA Generation & Quality Review β Generating difficult, multi-hop QA pairs through event fuzzification, followed by LLM and human verification for correctness, task difficulty, answer uniqueness.
Core statistics:
- 360 QA pairs across 9 domains (Geography, History, Technology, Sports, Arts, Movies, Science, Finance, Food)
- 3 difficulty levels β Easy (33.9%), Medium (44.4%), Hard (21.7%)
- Median video duration: 242.2s | Median audio duration: 197.0s
- 98.6% require web search; 74.4% require code / computation
OmniAtlas is trained in two stages:
- Trajectory Synthesis & Supervised Learning β Gemini-3 provides step supervision while DeepSeek-V3.2 performs tool-augmented reasoning. Successful trajectories are used for SFT.
- OmniDPO: Fine-Grained Error Correction β Gemini-3 identifies and corrects errors in failed trajectories across perception, reasoning, and tool-use dimensions, producing preference pairs for DPO training.
# Create conda environment
conda create -n omnigaia python=3.10
conda activate omnigaia
# Clone the repository
git clone https://github.com/RUC-NLPIR/OmniGAIA.git
cd OmniGAIA
# Install dependencies
pip install -r requirements.txt- ffmpeg is required for video/audio processing in OmniAtlas:
# Ubuntu / Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg # Windows (via Chocolatey) choco install ffmpeg
All runtime configuration is managed via config/config.json, where you need to set:
- Main agent endpoint (
agent.api_base_url,agent.api_key,agent.model_name) - Evaluation LLM endpoint (
evaluation.base_url,evaluation.api_key,evaluation.model) - Web search API key (
web_tools.serper_api_key) - Jina API key (
web_tools.jina_api_key)
Before running agents, ensure your LLM and auxiliary models are served via an OpenAI-compatible API (e.g. using vLLM, SGLang, or a cloud API):
# Example: serve OmniAtlas-Qwen3-30B-A3B with vLLM
vllm serve /path/to/your/OmniAtlas-Qwen3-30B-A3B \
--served-model-name omniatlas-30b \
--port 8080 \
--host 0.0.0.0 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--uvicorn-log-level debug \
--max-model-len 65536 Place the benchmark JSON and media files under the data/ directory:
data/
βββ test_metadata.json # Benchmark questions
βββ videos/ # Video files referenced in questions
βββ audios/ # Audio files referenced in questions
βββ images/ # Image files referenced in questions
test_metadata.jsoncan be downloaded from: https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA/blob/main/raw/test_metadata.jsonvideos/,audios/, andimages/can be downloaded from: https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA/tree/main/data_media_test
The baseline agent supports both Gemini and Qwen model families. The model family is auto-detected from the --model_name argument.
# ββ Run with Gemini ββββββββββββββββββββββββββββββββββββββββββββββ
python src/run_base_agent.py \
--input_file ./data/test_metadata.json \
--api_base_url "https://your-gemini-endpoint/v1" \
--model_name "gemini-3-flash" \
--api_key "YOUR_API_KEY" \
--concurrent_limit 16
# ββ Run with Qwen (OpenAI-compatible endpoint) ββββββββββββββββββ
python src/run_base_agent.py \
--input_file ./data/test_metadata.json \
--api_base_url "http://localhost:8000/v1" \
--model_name "qwen3-omni-30b-a3b-thinking" \
--api_key "empty" \
--concurrent_limit 16Parameters:
| Parameter | Description |
|---|---|
--input_file |
Path to the benchmark JSON file |
--api_key |
API key for the model endpoint |
--api_base_url |
Base URL of the model API |
--model_name |
Model identifier (auto-selects Gemini vs Qwen agent) |
--level |
Filter by difficulty: Easy, Medium, or Hard |
--max_items |
Limit the number of items to process |
--concurrent_limit |
Maximum concurrent API calls (default: 5) |
--max_action_limit |
Maximum number of tool-call turns before forced final answer (default: 50) |
--use_asr |
Use Whisper ASR to convert audio to text (for text-only models) |
--enable-active-perception |
Enable read_video / read_audio / read_image tools (Qwen/OmniAtlas models only) |
--output_dir |
Directory for results (default: ./outputs) |
--request_timeout |
Per-request timeout in seconds (default: 600) |
--forced_final_timeout |
Timeout for forced final answer after max turns (default: 300) |
--ffmpeg_timeout |
Timeout for ffmpeg-related media processing (default: 180) |
--item_timeout |
Max total processing time (default: 36000 (10 hours)) |
--eval_timeout |
Timeout for LLM equivalence evaluation (default: 120) |
--skip_eval |
Skip LLM-based equivalence evaluation |
OmniAtlas behavior is enabled in run_base_agent.py via --enable-active-perception (Qwen/OmniAtlas models only). This allows the model to request specific video/audio/image segments during reasoning:
python src/run_base_agent.py \
--input_file ./data/test_metadata.json \
--api_base_url "http://localhost:8000/v1" \
--model_name "omniatlas-qwen-30b-a3b" \
--api_key "empty" \
--enable-active-perception \
--concurrent_limit 16run_base_agent.py automatically evaluates results after generation. The evaluation includes:
- Exact Match (EM): Normalised string comparison between the predicted answer and ground truth.
- LLM Equivalence: An LLM judge (e.g. DeepSeek-V3) determines whether the predicted answer is semantically equivalent to the ground truth.
Results and metrics are saved to the outputs/ directory.
To re-run evaluation on previously generated results (e.g. with a different evaluation model):
python src/evaluate/eval_results.py \
--input_file ./outputs/base_agent_omniatlas-30b/run_20260101_120000_em0.2500_llmeq0.4000.json \
--test_file_path ./data/test_metadata.json \
--concurrent_limit 64Parameters:
| Parameter | Description |
|---|---|
--input_file |
Path to the results JSON from a previous run |
--test_file_path |
(Optional) Original test JSON to recover missing category labels |
--concurrent_limit |
Maximum concurrent evaluation API calls (default: 64) |
Each run produces two files:
run_<timestamp>_em<score>_llmeq<score>.jsonβ Per-item results with predictions, messages, and scores.run_<timestamp>_em<score>_llmeq<score>_metrics.jsonβ Aggregated metrics (overall, by difficulty level, and by category).
Example metrics output:
==================================================
Total Items: 360
Average EM Score: 0.2500
Average LLM Equal Score:0.4000
Average Tool Calls: 6.50
Non-Empty Answer Ratio: 0.9800
--------------------
Easy (n=122): EM=0.3500, LLM_Eq=0.5200
Medium (n=160): EM=0.2300, LLM_Eq=0.3800
Hard (n=78 ): EM=0.1400, LLM_Eq=0.2600
--------------------
Geo. (n=69 ): EM=0.2800, LLM_Eq=0.4200
Tech. (n=49 ): EM=0.2600, LLM_Eq=0.4100
...
==================================================
OmniGAIA agents are equipped with the following external tools:
| Tool | Description | Key Dependencies |
|---|---|---|
| Web Search | Google search via Serper API with result caching | aiohttp, Serper API |
| Page Browser | Fetch and extract webpage content via Jina Reader API | aiohttp, beautifulsoup4, Jina API |
| Code Executor | Sandboxed Python execution with common scientific libraries | Built-in (exec/eval) |
| Active Perception (OmniAtlas only) | read_video, read_audio, read_image β request specific media segments during reasoning |
opencv-python, pydub, ffmpeg |
If you find this work helpful, please kindly cite our paper:
@misc{li2026omnigaia,
title={OmniGAIA: Towards Native Omni-Modal AI Agents},
author={Xiaoxi Li and Wenxiang Jiao and Jiarui Jin and Shijian Wang and Guanting Dong and Jiajie Jin and Hao Wang and Yinuo Wang and Ji-Rong Wen and Yuan Lu and Zhicheng Dou},
year={2026},
eprint={2602.22897},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.22897},
}
This project is released under the MIT License.
For any questions or feedback, please reach out to us at xiaoxi_li@ruc.edu.cn.



