Web-CogReasoner introduces a paradigm shift from simply enhancing web agents to systematically building their cognitive abilities from the ground up. Inspired by Bloom’s Taxonomy, we decomposes agent capabilities into knowledge content learning (Factual, Conceptual) and cognitive processes (Procedural), enabling interpretable and goal-directed behavior. Built upon large multimodal models, it performs knowledge-driven Chain-of-Thought (CoT) reasoning across complex web tasks, where each reasoning step is transparently grounded in a specific knowledge type, ensuring both interpretability and robust generalization.
To support this, we introduce:
-
Web-CogKnowledge Framework: A Bloom's Taxonomy-inspired two-stage training paradigm (Knowledge Content Learning → Cognitive Reasoning) for enhancing web agents' cognitive abilities.
-
Web-CogReasoner: A knowledge-driven multimodal agent trained via imitation learning in our Web-CogDataset.
-
Web-CogDataset: A curriculum-style dataset with 12 fine-grained tasks across 3 knowledge levels (Factual, Conceptual, Procedural), enabling stepwise skill acquisition.
-
Web-CogBench: A dedicated benchmark for evaluating whether a web agent possesses the requisite prior knowledge and cognitive capabilities for effective web navigation.
Last Updated: 2025-08-05 13:08 UTC+8
-
Paper: Release the full research paper on arXiv. - Code: Open-source the complete code for training and inference.
- Model: Publish the official Web-CogReasoner model weights.
- Dataset: Make the Web-CogDataset publicly available for community research.
- Benchmark: Launch a public online evaluation server for Web-CogBench to ensure fair comparisons.
[2025-08-05] Release the full research paper on arXiv.
Cognitive & Visual Benchmarks
This comparison highlights our model's strength in reasoning, a crucial capability that visual-centric models may lack.
| Model | Web-CogBench (Cognition) | VisualWebBench (Vision) |
|---|---|---|
| Proprietary Models | ||
| Claude Sonnet 4 | 76.8% | 85.9% |
| Gemini 2.5 Pro | 80.2% | 86.6% |
| Open-Source Models | ||
| Qwen2.5-VL-7B | 69.8% | 76.0% |
| UI-TARS-7B-SFT | 46.4% | 86.0% |
| Web-CogReasoner (Ours) | 82.9% | 86.3% |
Key Insight: While some models like
UI-TARSexcel at visual tasks (VisualWebBench: 86.0%), they struggle with reasoning-intensive tasks (Web-CogBench: 48.2%). This highlights that strong visual perception does not guarantee advanced cognitive capabilities—a gap our work aims to fill.
Online Web Task
This section evaluates the models' ability to perform complex, multi-step tasks in live web environments.
| Model | WebVoyager (Generalization) | Mind2Web (Cross-task) | Mind2Web (Cross-web) |
|---|---|---|---|
| Proprietary Models | |||
| Claude Sonnet 4 | 47.7% | 40.2% | 21.7% |
| Gemini 2.5 Pro | 54.9% | 37.5% | 25.5% |
| Open-Source Models | |||
| Qwen2.5-VL-7B | 2.2% | 1.0% | 1.0% |
| OpenWebVoyagerIL | 18.1% | 6.3% | 6.6% |
| Web-CogReasoner (Ours) | 30.2% | 17.0% | 10.1% |
Coming soon
@article{guo2025web,
title={Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents},
author={Guo, Yuhan and Guo, Cong and Sun, Aiwen and He, Hongliang and Yang, Xinyu and Lu, Yue and Zhang, Yingji and Guo, Xuntao and Zhang, Dong and Liu, Jianzhuang and others},
journal={arXiv preprint arXiv:2508.01858},
year={2025}
}

