π Try our Demo!
MiroThinker is the official implementation of the MiroMind Research Agent Project. It is an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities, enabling complex real-world research workflows across diverse challenges.
The project currently comprises four key components:
- π‘ MiroThinker: An open-source research agent model that natively supports tool-assisted reasoning, achieving state-of-the-art performance across multiple benchmarks (e.g., HLE, HLE-Text-2158, HLE-Text-500, BrowserComp, BrowserComp-ZH, GAIA, xBench-DeepSearch, FutureX, and Frames). See Quick Start.
- π€ MiroFlow: An open-source research agent framework that offers reproducible state-of-the-art performance across multiple benchmarks. See MiroFlow for details.
- π MiroVerse: A premium open-source training dataset with 147k samples supporting research agent training. See MiroVerse on HuggingFace.
- π§ MiroTrain / MiroRL: Training infrastructure that supports stable and efficient training for research agent models. See MiroTrain and MiroRL for details.
- π° News & Updates
- π Introduction
- β¨ Key Features
- π Performance on Benchmarks
- π Quick Start
- π Trace Collection
- β FAQ & Troubleshooting
- π License
- π Acknowledgments
- [2025-11-13] ππ MiroThinker-v1.0 is now released! Introducing interactive scaling as a third dimension of performance improvement, MiroThinker v1.0 supports 256K context window and up to 600 tool calls per task. Available in 8B, 30B, and 72B parameter scales, achieving 37.7%, 47.1%, 55.6%, and 81.9% on HLE-Text, BrowseComp, BrowseComp-ZH, and GAIA-Text-103, respectively. See Technical Report for more details.
- [2025-09-11] π MiroThinker-72B-Preview ranked 4th in this week's FutureX benchmark. See FutureX.
- [2025-09-08] MiroThinker-v0.2 is now released, achieving open-source SOTA performance across multiple benchmarks, including HLE (17.8%), HLE-Text-Only (19.1%), BrowserComp-EN (17.2%), BrowserComp-ZH (29.4%), xBench-DeepSearch (56.0%), and Frames (74.8%).
- [2025-09-07] We supported more benchmarks, including BrowseComp-ZH, XBench-DeepSearch, and FutureX. We plan to add more benchmarks in the future.
- [2025-08-22] Introducing streamlined deployment options for MiroThinker models with optimized resource usage and faster startup times. Experience the interactive demo: π Try Gradio Demo
- [2025-08-08] MiroThinker-v0.1 released. Models, framework, and data are now fully open-sourced!
Unlike previous agents that scale only model size or context length, MiroThinker v1.0 introduces interactive scaling at the model level, systematically training the model to handle deeper and more frequent agentβenvironment interactions as a third dimension of performance improvement. Interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories.
- π 256K Context Window: Supports long-horizon reasoning and deep multi-step analysis
- π§ 600 Tool Calls: Handles up to 600 tool calls per task β a substantial improvement over previous open-source research agents
- π¦ Multiple Scales: Released in 8B, 30B, and 72B parameter scales, accompanied by a comprehensive suite of tools and workflows to flexibly support diverse research settings and compute budgets
| Model Name | Base Model | Max Length | Max Tool Calls | HF Link |
|---|---|---|---|---|
| MiroThinker-v1.0-8B | Qwen3-8B | 256K | 600 | π€ link |
| MiroThinker-v1.0-30B | Qwen3-30B-A3B-Thinking-2507 | 256K | 600 | π€ link |
| MiroThinker-v1.0-72B | Qwen2.5-72B-Instruct | 256K | 600 | π€ link |
MiroThinker v1.0 demonstrates strong general-research performance across a broad range of benchmarks, achieving 37.7%, 47.1%, 55.6%, and 81.9% on HLE-Text, BrowseComp, BrowseComp-ZH, and GAIA-Text-103, respectively. These results surpass previous open-source agents and narrow the gap with commercial counterparts such as GPT-5-high.
π¦ Click to expand MiroThinker-v0.2 details
In this new version, we introduced three key improvements:
- π Richer training data from both English and Chinese sources, yielding significant gains in benchmark performance and generalization
- π― Unified DPO training with a single preference dataset across all models
- π Extended context length from 40k to 64k for more challenging multi-turn tool-use tasks
Compared to v0.1, MiroThinker v0.2 delivers consistent gains across benchmarks. For example, scores improved from 57.3 β 64.1 on GAIA-Text-103 and from 17.0 β 29.4 on BrowseComp-ZH, reflecting substantial advancements in the modelβs general research agent capabilities.
| Model Name | Base Model | Max Length | HF Link |
|---|---|---|---|
| MiroThinker-4B-SFT-v0.2 | Qwen3-4B | 64K | π€ link |
| MiroThinker-4B-DPO-v0.2 | Qwen3-4B | 64K | π€ link |
| MiroThinker-8B-SFT-v0.2 | Qwen3-8B | 64K | π€ link |
| MiroThinker-8B-DPO-v0.2 | Qwen3-8B | 64K | π€ link |
| MiroThinker-14B-SFT-v0.2 | Qwen3-14B | 64K | π€ link |
| MiroThinker-14B-DPO-v0.2 | Qwen3-14B | 64K | π€ link |
| MiroThinker-32B-SFT-v0.2 | Qwen3-32B | 64K | π€ link |
| MiroThinker-32B-DPO-v0.2 | Qwen3-32B | 64K | π€ link |
π¦ Click to expand MiroThinker-v0.1 details
We have released the MiroThinker v0.1 series, including both SFT and DPO variants at parameter scales of 8B, 14B, and 32B. Notably, MiroThinker v0.1 achieves state-of-the-art performance among open-source models on the GAIA benchmark, a rigorous evaluation suite for advanced agentic capabilities, demonstrating its strength in long-context, decision-intensive, and real-world task scenarios.
| Model Name | Base Model | Max Length | HF Link |
|---|---|---|---|
| MiroThinker-8B-SFT-v0.1 | Qwen3-8B | 40K | π€ link |
| MiroThinker-8B-DPO-v0.1 | Qwen3-8B | 40K | π€ link |
| MiroThinker-14B-SFT-v0.1 | Qwen3-14B | 40K | π€ link |
| MiroThinker-14B-DPO-v0.1 | Qwen3-14B | 40K | π€ link |
| MiroThinker-32B-SFT-v0.1 | Qwen3-32B | 40K | π€ link |
| MiroThinker-32B-DPO-v0.1 | Qwen3-32B | 40K | π€ link |
- π Fully Open-Source Agent Framework: Complete transparency with open framework and open models
- π Tool Integration: Seamless integration with external tools and APIs
- π Trace Collection: Comprehensive logging and analysis of agent interactions with elapsed time and estimated completion time displayed in minutes. Ready for SFT and DPO
- π Benchmark Evaluation: Extensive testing across multiple benchmark datasets
π Click to expand benchmark list
- GAIA Validation: A benchmark for General AI Assistants. (paper)
- GAIA-Text-103: A subset of GAIA Validation for text-only tasks. (paper)
- HLE: Humanity's Last Exam. (paper)
- HLE-Text-2158: A subset of HLE for text-only tasks. (paper)
- HLE-Text-500: A subset of HLE for text-only tasks, created by WebThinker. (paper)
- BrowseComp-EN: Web browsing and comprehension tasks. (paper)
- BrowseComp-ZH: A Chinese version of BrowseComp. (paper)
- WebWalkerQA: Web navigation and question answering. (paper)
- Frames: Factuality, Retrieval, And reasoning MEasurement Set. (paper)
- XBench-DeepSearch: A benchmark for deep research agents. (website)
- FutureX: A live benchmark designed for predicting unknown future. (website)
- SEAL-0: A benchmark for evaluating LLMs on conflicting-evidence web questions. (paper)
- AIME2025: American Invitational Mathematics Examination 2025. (website)
π¦ Click to expand MiroThinker-v0.1 details
| Method | Text-103 Best Pass@1 |
Text-103 Pass@1 (Avg@8) |
Val-165 Best Pass@1 |
Val-165 Pass@1 (Avg@8) |
|---|---|---|---|---|
| πΉββ 7B/8B Models ββ | ||||
| Search-o1-7B | 17.5 | - | - | - |
| R1-Searcher-7B | 20.4 | - | - | - |
| WebDancer-7B | 31.0 | - | - | - |
| WebSailor-7B | 37.9 | - | - | - |
| CK-Pro-8B | 40.3 | - | 32.7 | - |
| MiroThinker-8B-SFT-v0.1 | 44.7 | 40.1 | 34.6 | 31.8 |
| + Commercial Tools | 46.6 | 42.1 | 37.6 | 33.9 |
| MiroThinker-8B-DPO-v0.1 | 46.6 | 44.8 | 37.0 | 35.4 |
| + Commercial Tools | 50.5 | 46.7 | 38.2 | 35.9 |
| πΉββ 14B Models ββ | ||||
| MiroThinker-14B-SFT-v0.1 | 47.6 | 44.4 | 37.0 | 34.4 |
| + Commercial Tools | 49.5 | 47.5 | 41.8 | 39.8 |
| MiroThinker-14B-DPO-v0.1 | 48.5 | 46.6 | 42.4 | 39.2 |
| + Commercial Tools | 52.4 | 48.5 | 45.5 | 42.0 |
| πΉββ 32B Models ββ | ||||
| Qwen3-32B | 31.1 | 26.7 | 29.7 | 26.4 |
| Search-o1-32B | 28.2 | - | - | - |
| WebThinker-32B-RL | 48.5 | - | - | - |
| WebDancer-QwQ-32B | 51.5 | - | - | - |
| WebSailor-32B | 53.2 | - | - | - |
| WebShaper-QwQ-32B | 53.3 | - | - | - |
| MiroThinker-32B-SFT-v0.1 | 55.3 | 51.3 | 44.9 | 42.7 |
| + Commercial Tools | 58.3 | 54.2 | 48.5 | 45.8 |
| MiroThinker-32B-DPO-v0.1 | 57.3 | 54.1 | 48.5 | 45.9 |
| + Commercial Tools | 60.2 | 57.9 | 50.9 | 48.9 |
-
Following the practices of WebThinker, WebAgents, and CognitiveKernel, we report the Best Pass@1, the highest score across three runs, which often reflects stronger performance, though it may exhibit some variability. To provide a more stable measure, we additionally report Pass@1 (Avg@8), which offers greater consistency at the cost of slightly lower scores.
-
For consistency with prior open-source works, we evaluate GAIA-Text-103 using the WebAgents LLM-as-judge template, and report results on GAIA-Val-165 using the official GAIA scorer script.
-
By default, we use open-source tools wherever possible, except for the code tool E2B and the Google search tool Serper. We use Whisper, Qwen2.5-VL-72B-Instruct, and Qwen3-235B-A22B-Thinking-2507 in our implementation. The framework can be easily extended to other open-source tools of your choice.
-
Replacing these open-source tools with commercial alternatives can yield performance gains. Commercial tools were mainly used for multimodal capabilities and certain complex reasoning subtasks. The majority of tasks, including planning, browsing, refinement, navigation, and more, were handled by our models.
| Method | HLE Pass@1 |
Frames Pass@1 |
BrowseComp Pass@1 |
BrowseComp-ZH Pass@1 |
WebWalkerQA Pass@1 |
|---|---|---|---|---|---|
| OpenAI Deep Research | 26.6 | - | 51.5 | 42.9 | - |
| Gemini Deep Research | 26.9 | - | - | - | - |
| Kimi-Researcher | 26.9 | 78.8 | - | - | - |
| WebDancer-7B | - | - | - | - | 36.0 |
| WebSailor-7B | - | - | 6.7 | 14.2 | - |
| MiroThinker-8B-SFT-v0.1 | - | 58.0 | 5.5 | 9.3 | 41.3 |
| MiroThinker-8B-DPO-v0.1 | - | 64.4 | 8.7 | 13.6 | 45.7 |
| WebThinker-32B-RL | - | - | - | - | 46.5 |
| WebDancer-QwQ-32B | - | - | 3.8 | 18.0 | 47.9 |
| WebSailor-32B | - | - | 10.5 | 25.5 | - |
| WebShaper-32B | - | - | - | - | 51.4 |
| MiroThinker-32B-SFT-v0.1 | 10.2 | 70.4 | 10.6 | 13.8 | 45.7 |
| MiroThinker-32B-DPO-v0.1 | 11.8 | 71.7 | 13.0 | 17.0 | 49.3 |
-
MiroThinkerβs performance was tested with this repository and open-source tools; other modelsβ results are from their papers and official sites.
-
As MiroVerse-v0.1 mainly contains English data, the modelβs Chinese capability is limited. We plan to add more Chinese data to improve performance in the next version.
For the fastest setup with minimal configuration:
# 1. Clone and setup
git clone https://github.com/MiroMindAI/MiroThinker
cd MiroThinker/apps/miroflow-agent
uv sync
# 2. Configure minimal environment (MiroThinker v1.0)
cp .env.example .env
# Edit .env with these required keys:
# - SERPER_API_KEY (for Google search)
# - JINA_API_KEY (for web scraping)
# - E2B_API_KEY (for code execution)
# - SUMMARY_LLM_BASE_URL, SUMMARY_LLM_MODEL_NAME, SUMMARY_LLM_API_KEY (for LLM summarization)
# - OPENAI_API_KEY (required for benchmark evaluation, used for LLM-as-a-Judge)
# 3. Serve your model (or use existing API)
# See "Serve the MiroThinker Model" section below
# 4. Run evaluation
uv run main.py llm=qwen-3 agent=single_agent_keep5 llm.base_url=https://your_base_url/v1π‘ Minimal Configuration: MiroThinker v1.0 uses only 3 MCP servers:
search_and_scrape_webpage,jina_scrape_llm_summary, andtool-python. This is the simplest setup. See Tool Configuration for details.
- π Python 3.10+
- π¦ uv package manager (Installation guide)
- π Required API keys (see configuration section below)
git clone https://github.com/MiroMindAI/MiroThinker
cd MiroThinkerwget https://huggingface.co/datasets/miromind-ai/MiroFlow-Benchmarks/resolve/main/data_20251115_password_protected.zip
unzip data_20251115_password_protected.zip
# The unzip passcode is: pf4*
rm data_20251115_password_protected.zipπ Password: The unzip passcode is
pf4*.
# Shift working dir
cd apps/miroflow-agent
# Install environment
uv sync
# Create .env file with your API keys
cp .env.example .env
# Edit .env with your actual API keys based on your chosen configurationπ Environment Variables: The
.env.examplefile contains all available environment variables. Configure the variables according to the tools used in your chosen agent configuration (see Tool Configuration section).
| Server | Description | Tools Provided | Required Environment Variables |
|---|---|---|---|
tool-python |
Execution environment and file management (E2B sandbox) | create_sandbox, run_command, run_python_code, upload_file_from_local_to_sandbox, download_file_from_sandbox_to_local, download_file_from_internet_to_sandbox |
E2B_API_KEY |
search_and_scrape_webpage |
Google search via Serper API | google_search |
SERPER_API_KEY, SERPER_BASE_URL |
jina_scrape_llm_summary |
Web scraping with LLM-based information extraction | scrape_and_extract_info |
JINA_API_KEY, JINA_BASE_URL, SUMMARY_LLM_BASE_URL, SUMMARY_LLM_MODEL_NAME, SUMMARY_LLM_API_KEY |
Minimal .env configuration example:
# Required for MiroThinker v1.0 (minimal setup)
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL="https://google.serper.dev"
JINA_API_KEY=your_jina_key
JINA_BASE_URL="https://r.jina.ai"
E2B_API_KEY=your_e2b_key
# Required for jina_scrape_llm_summary
SUMMARY_LLM_BASE_URL=your_llm_base_url
SUMMARY_LLM_MODEL_NAME=your_llm_model_name
SUMMARY_LLM_API_KEY=your_llm_api_key # Optional, depends on LLM provider
# Required for benchmark evaluation (LLM-as-a-Judge)
OPENAI_API_KEY=your_openai_key # Required for running benchmark evaluationsπ‘ Why this is minimal: These 3 MCP servers cover the core capabilities needed for research tasks: web search, content extraction, and code execution. Each server provides multiple tools. All other servers are optional enhancements.
π For Benchmark Evaluation: If you plan to run benchmark evaluations, you also need
OPENAI_API_KEYfor LLM-as-a-Judge functionality used in evaluation scripts.π For more details: See MiroFlow Tools README for complete documentation of all available tools.
π§ Click to expand additional available tools
The following optional tools are available but were not used in MiroThinker v1.0 evaluation:
| Server Name | Type | Description |
|---|---|---|
tool-vqa |
Commercial | Vision processing using Claude |
tool-vqa-os |
Open-Source | Vision processing (open-source alternative) |
tool-transcribe |
Commercial | Audio transcription using OpenAI |
tool-transcribe-os |
Open-Source | Audio transcription using Whisper |
tool-reasoning |
Commercial | Reasoning engine using Claude |
tool-reasoning-os |
Open-Source | Reasoning engine (open-source alternative) |
tool-reading |
Open-Source | Document reading using MarkItDown |
tool-google-search |
Commercial | Web search using Google + scraping |
tool-sougou-search |
Commercial | Web search using Sougou (Chinese) |
π Local Deployment: For instructions on deploying open-source tools (
tool-vqa-os,tool-transcribe-os,tool-reasoning-os) locally, see Local Tool Deployment Guide.
See the MiroFlow Tools README for complete documentation of all available tools.
βοΈ Click to expand pre-configured agent settings table
The apps/miroflow-agent/conf/agent/ directory contains several pre-configured agent settings. Each configuration uses different tools and requires corresponding environment variables in your .env file.
π‘ Recommended: For MiroThinker v1.0, use
single_agentorsingle_agent_keep5(minimal configuration with only 3 MCP servers).
| Configuration File | Description | Max Turns | Context Retention | Required Environment Variables | Recommended For |
|---|---|---|---|---|---|
single_agent.yaml β |
Single-agent configuration used in MiroThinker v1.0 (minimal setup) | 600 | Keep all results | SERPER_API_KEY, SERPER_BASE_URL, JINA_API_KEY, JINA_BASE_URL, E2B_API_KEY, SUMMARY_LLM_BASE_URL, SUMMARY_LLM_MODEL_NAME, SUMMARY_LLM_API_KEY |
v1.0 (default) |
single_agent_keep5.yaml β |
Single-agent with recency-based context retention (minimal setup) | 600 | Keep 5 most recent | Same as single_agent.yaml |
v1.0 (recommended) |
multi_agent.yaml |
Multi-agent with commercial tools (v0.1/v0.2) | 50 | Keep all results | E2B_API_KEY, ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, OPENAI_API_KEY, OPENAI_BASE_URL, SERPER_API_KEY, SERPER_BASE_URL, JINA_API_KEY, JINA_BASE_URL |
v0.1/v0.2 |
multi_agent_os.yaml |
Multi-agent with open-source tools (v0.1/v0.2) | 50 | Keep all results | E2B_API_KEY, VISION_API_KEY, VISION_BASE_URL, VISION_MODEL_NAME, WHISPER_API_KEY, WHISPER_BASE_URL, WHISPER_MODEL_NAME, REASONING_API_KEY, REASONING_BASE_URL, REASONING_MODEL_NAME, SERPER_API_KEY, SERPER_BASE_URL, JINA_API_KEY, JINA_BASE_URL |
v0.1/v0.2 |
π‘ Note: All environment variables are listed in
apps/miroflow-agent/.env.example. Copy it to.envand fill in the values for the tools you plan to use.
π§ Click to expand custom tool configuration guide
You can create your own YAML configuration file to freely combine MCP servers. Here's how:
- Create a new YAML file in
apps/miroflow-agent/conf/agent/:
# conf/agent/my_custom_config.yaml
defaults:
- default
- _self_
main_agent:
tools:
- tool-python # Execution environment
- search_and_scrape_webpage # Google search
- jina_scrape_llm_summary # Web scraping with LLM
- tool-vqa # Vision processing (optional)
- tool-transcribe # Audio processing (optional)
- tool-reasoning # Reasoning engine (optional)
- tool-reading # Document reading (optional)
max_turns: 400 # Maximum number of turns
sub_agents:
agent-browsing: # Optional sub-agent
tools:
- tool-google-search
- tool-vqa
- tool-reading
- tool-python
max_turns: 50
keep_tool_result: -1 # Context retention budget: -1 keeps all tool results, or specify K to keep only the K most recent tool responsesπ‘ Context Retention Strategy: The
keep_tool_resultparameter implements a recency-based context retention strategy. In the standard ReAct paradigm, all tool outputs are retained in the message history, which can lead to inefficient context utilization. Empirically, we observe that the model's subsequent actions depend primarily on recent observations rather than distant ones. This strategy retains only the most recent K tool responses (where K is thekeep_tool_resultvalue) while preserving the complete sequence of thoughts and actions.Benefits:
- β Preserves the reasoning and action trace
- β Focuses the model's attention on the most contextually relevant observations
- β Frees additional context space for extended reasoning and deeper tool-use trajectories
- β Does not lead to performance degradation while allowing more context space for interactive scaling
Usage: Set
keep_tool_result: -1to keep all tool results, or specify a positive integer K (e.g.,keep_tool_result: 5) to keep only the K most recent tool responses.
- Use your custom configuration when running evaluations:
cd apps/miroflow-agent
uv run main.py llm=qwen-3 agent=my_custom_config llm.base_url=https://your_base_url/v1-
Configure environment variables in
.envbased on the tools you use.All available environment variables are listed in
apps/miroflow-agent/.env.example. Copy it to.envand configure the variables according to your chosen configuration:cd apps/miroflow-agent cp .env.example .env # Edit .env with your actual API keys
For MiroThinker v1.0 (
single_agent.yamlorsingle_agent_keep5.yaml), see the Minimal Configuration section above for the complete configuration example.For other configurations, refer to the Pre-configured Agent Settings table above to see which environment variables are required.
π Click to expand optional API keys
# API for LLM-as-Judge (for benchmark testing, required for benchmark evaluation)
OPENAI_API_KEY=your_openai_key
# API for Open-Source Audio Transcription Tool (for benchmark testing, optional)
WHISPER_MODEL_NAME="openai/whisper-large-v3-turbo"
WHISPER_API_KEY=your_whisper_key
WHISPER_BASE_URL="https://your_whisper_base_url/v1"
# API for Open-Source VQA Tool (for benchmark testing, optional)
VISION_MODEL_NAME="Qwen/Qwen2.5-VL-72B-Instruct"
VISION_API_KEY=your_vision_key
VISION_BASE_URL="https://your_vision_base_url/v1/chat/completions"
# API for Open-Source Reasoning Tool (for benchmark testing, optional)
REASONING_MODEL_NAME="Qwen/Qwen3-235B-A22B-Thinking-2507"
REASONING_API_KEY=your_reasoning_key
REASONING_BASE_URL="https://your_reasoning_base_url/v1/chat/completions"
# API for Claude Sonnet 3.7 as Commercial Tools (optional)
ANTHROPIC_API_KEY=your_anthropic_key
# API for Sougou Search (optional)
TENCENTCLOUD_SECRET_ID=your_tencent_cloud_secret_id
TENCENTCLOUD_SECRET_KEY=your_tencent_cloud_secret_key
# API for Summary LLM (optional)
SUMMARY_LLM_BASE_URL=your_summary_llm_base_url
SUMMARY_LLM_MODEL_NAME=your_summary_llm_model_name
SUMMARY_LLM_API_KEY=your_summary_llm_api_keyUse SGLang to serve MiroThinker models at port 61002:
NUM_GPUS=4
PORT=61002
# Downloading model from HF
MODEL_PATH=miromind-ai/MiroThinker-v1.0-30B
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--tp $NUM_GPUS \
--dp 1 \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-codeπ Server URL: This will start a server at
http://0.0.0.0:$PORT. Use this as your server base URL (e.g.,http://0.0.0.0:61002/v1).
We also provide comprehensive guidance for serving MiroThinker models using CPU-optimized and GPU-accelerated quantization techniques, along with detailed analysis and guidelines for deployment with llama.cpp, Ollama, SGLang, and other inference frameworks.
π Complete Guide: See Deployment Documentation for detailed deployment instructions.
cd apps/miroflow-agent
uv run main.py llm=qwen-3 agent=single_agent llm.base_url=https://your_base_url/v1π‘ Tip: For MiroThinker v1.0, use
agent=single_agentoragent=single_agent_keep5. Replacehttps://your_base_url/v1with your actual model server URL.
Note: For MiroThinker v1.0, use
single_agentorsingle_agent_keep5configurations. Themulti_agentandmulti_agent_osconfigurations are for v0.1/v0.2.
Available Parameters:
You can customize the evaluation by setting the following environment variables before running the script:
| Parameter | Default | Description |
|---|---|---|
LLM_MODEL |
"MiroThinker-Models" |
Model name identifier |
BASE_URL |
"https://your-api.com/v1" |
Base URL of your model server |
NUM_RUNS |
8 (varies by benchmark) |
Number of evaluation runs |
LLM_PROVIDER |
"qwen" |
LLM provider (e.g., qwen, openai, anthropic) |
AGENT_SET |
"single_agent_keep5" |
Agent configuration (e.g., single_agent, single_agent_keep5, multi_agent, multi_agent_os) |
MAX_CONTEXT_LENGTH |
262144 |
Maximum context length (256K) |
MAX_CONCURRENT |
10 |
Maximum concurrent tasks |
PASS_AT_K |
1 |
Pass@K evaluation metric |
TEMPERATURE |
1.0 |
Sampling temperature |
API_KEY |
"xxx" |
API key for the model server |
Example Usage:
# Navigate to the miroflow-agent directory first
cd apps/miroflow-agent
# Basic usage with required parameters
LLM_MODEL="MiroThinker-v1.0-32B" BASE_URL="https://your-api.com/v1" bash scripts/run_evaluate_multiple_runs_gaia-validation.sh
# Customize number of runs and agent configuration
LLM_MODEL="MiroThinker-v1.0-32B" \
BASE_URL="https://your-api.com/v1" \
NUM_RUNS=3 \
AGENT_SET="single_agent" \
bash scripts/run_evaluate_multiple_runs_gaia-validation.shπ Click to expand all benchmark commands
# Navigate to the miroflow-agent directory first
cd apps/miroflow-agent
# GAIA-Text-103
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_gaia-validation-text-103.sh
# WebWalkerQA
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_webwalkerqa.sh
# HLE
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_hle.sh
# HLE-Text-2158
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_hle-text-2158.sh
# HLE-Text-500
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_hle-text-500.sh
# FRAMES
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_frames.sh
# BrowseComp-EN
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_browsecomp.sh
# BrowseComp-ZH
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_browsecomp_zh.sh
# XBench-DeepSearch
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_xbench_deepsearch.sh
# FutureX
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_futurex.sh
# SEAL-0
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_seal-0.sh
# AIME2025
LLM_MODEL="xxx" BASE_URL="xxx" bash scripts/run_evaluate_multiple_runs_aime2025.shπ Click to expand progress monitoring commands
# Navigate to the miroflow-agent directory first
cd apps/miroflow-agent
# For GAIA-Validation
python benchmarks/check_progress/check_progress_gaia-validation.py /path/to/evaluation/logs
# For GAIA-Text-103
python benchmarks/check_progress/check_progress_gaia-validation-text-103.py /path/to/evaluation/logs
# For HLE
python benchmarks/check_progress/check_progress_hle.py /path/to/evaluation/logs
# For HLE-Text-2158
python benchmarks/check_progress/check_progress_hle-text-2158.py /path/to/evaluation/logs
# For HLE-Text-500
python benchmarks/check_progress/check_progress_hle-text-500.py /path/to/evaluation/logs
# For BrowseComp-EN
python benchmarks/check_progress/check_progress_browsecomp.py /path/to/evaluation/logs
# For BrowseComp-ZH
python benchmarks/check_progress/check_progress_browsecomp_zh.py /path/to/evaluation/logs
# For WebWalkerQA
python benchmarks/check_progress/check_progress_webwalkerqa.py /path/to/evaluation/logs
# For Frames
python benchmarks/check_progress/check_progress_frames.py /path/to/evaluation/logs
# For XBench-DeepSearch
python benchmarks/check_progress/check_progress_xbench_deepsearch.py /path/to/evaluation/logs
# For SEAL-0
python benchmarks/check_progress/check_progress_seal-0.py /path/to/evaluation/logs
# For AIME2025
python benchmarks/check_progress/check_progress_aime2025.py /path/to/evaluation/logsπ Click to expand trace collection commands
cd apps/collect-trace
# Collect Traces for SFT
uv run bash scripts/collect_trace_claude37.sh
uv run bash scripts/collect_trace_gpt5.sh
# Collect Traces for DPO
uv run bash scripts/collect_trace_qwen3.shπ§ Click to expand troubleshooting guide
A: For most users, we recommend MiroThinker v1.0 with the minimal configuration:
- v1.0: Latest version with 256K context, 600 tool calls, best performance. Use
single_agentorsingle_agent_keep5config. - v0.2: Good performance with 64K context, 50 tool calls. Use
multi_agentormulti_agent_osconfig. - v0.1: Legacy version with 40K context. Use
multi_agentormulti_agent_osconfig.
| Version | Context | Max Tool Calls | Recommended Config | Use Case |
|---|---|---|---|---|
| v1.0 | 256K | 600 | single_agent_keep5 |
Latest, best performance, long-horizon tasks |
| v0.2 | 64K | 50 | multi_agent_os |
Good balance, multi-agent workflows |
| v0.1 | 40K | 50 | multi_agent_os |
Legacy support |
A: You need these keys for minimal setup:
- SERPER_API_KEY: Get from Serper.dev (Google search API)
- JINA_API_KEY: Get from Jina.ai (Web scraping)
- E2B_API_KEY: Get from E2B.dev (Code execution sandbox)
- SUMMARY_LLM_*: Your LLM API credentials (for content summarization)
- OPENAI_API_KEY: Get from OpenAI (Required for benchmark evaluation, used for LLM-as-a-Judge)
A: Common issues:
- Check base URL format: Should end with
/v1(e.g.,https://your-api.com/v1) - Verify API key: Ensure
API_KEYis set correctly in environment or script - Check server status: Make sure your model server is running and accessible
- Network issues: Verify firewall/network settings allow connections
A: Troubleshooting steps:
- Check working directory: Make sure you're in
apps/miroflow-agentdirectory - Verify environment: Run
uv syncto ensure dependencies are installed - Check .env file: Ensure all required environment variables are set
- Review logs: Check
logs/directory for detailed error messages - Verify data path: Ensure benchmark data is downloaded and in correct location
A: Solutions:
- Reduce context length: Set
MAX_CONTEXT_LENGTHto a smaller value (e.g., 131072 for 128K) - Use context retention: Use
single_agent_keep5instead ofsingle_agentto reduce memory usage - Reduce concurrent tasks: Set
MAX_CONCURRENTto a smaller number (e.g., 5) - Use smaller model: Try 8B or 30B models instead of 72B
A: Common fixes:
- E2B errors: Verify
E2B_API_KEYis valid and account has credits - Serper errors: Check
SERPER_API_KEYand rate limits - Jina errors: Verify
JINA_API_KEYandJINA_BASE_URLare correct - LLM summarization errors: Check
SUMMARY_LLM_*variables and model availability
A: Use the progress monitoring scripts:
cd apps/miroflow-agent
python benchmarks/check_progress/check_progress_<benchmark_name>.py /path/to/logsThe scripts show completion status, elapsed time, and estimated remaining time.
A: Yes! You can replace open-source tools with commercial alternatives:
- Replace
tool-vqa-oswithtool-vqa(Claude) - Replace
tool-transcribe-oswithtool-transcribe(OpenAI) - Replace
tool-reasoning-oswithtool-reasoning(Claude)
This typically improves performance but requires additional API keys. See Pre-configured Agent Settings for details.
- π Documentation: Check MiroFlow Tools README for tool details
- π¬ Discord: Join our Discord community
- π Issues: Report bugs on GitHub Issues
- π§ Contact: Visit our website for more information
This project is licensed under the MIT License - see the LICENSE file for details.
We extend our sincere gratitude to:
- π Benchmark Contributors for the comprehensive evaluation datasets
- π Open Source Community for the tools and libraries that make this possible
- π₯ All Contributors who have helped make MiroThinker better
Join our community and help us build the future of AI agents!
If you find this project useful in your research, please consider cite:
@article{miromind2025mirothinker,
title={MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling},
author={MiroMind Team and Bai, Song and Bing, Lidong and Chen, Carson and Chen, Guanzheng and Chen, Yuntao and Chen, Zhe and Chen, Ziyi and Dai, Jifeng and Dong, Xuan and others},
journal={arXiv preprint arXiv:2511.11793},
year={2025}
}






