
Figure: A visual taxonomy of AI agent evolution and optimisation techniques, categorised into three major directions: single-agent optimisation, multi-agent optimisation, and domain-specific optimisation. The tree structure illustrates the development of these approaches from 2023 to 2025, including representative methods within each branch.
- (Arxiv'25) EvoAgentX: An Automated Framework for Evolving Agentic Workflows [π» Code] [π Paper]
- (ICLR'24) ToRA: A tool-integrated reasoning agent for mathematical problem solving [π Paper] [π» Code]
- (NeurIPS'22) STaR : Bootstrapping reasoning with reasoning [π Paper] [π» Code]
- (Arxiv'24) NExT: Teaching large language models to reason about code execution [π Paper]
- (EMNLP'24) MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning [π Paper]
- (ICML'24) Self-Rewarding Language Models [π Paper] [π» Code]
- (Arxiv'24) Tulu 3: Pushing Frontiers in Open Language Model Post-Training [π Paper] [π» Code]
- (EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [π Paper] [π» Code]
- (Arxiv'24) Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [π Paper]
- (Arxiv'24) DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [π Paper]
- (ICML'25) Diving into Self-Evolving Training for Multimodal Reasoning [π Paper] [π» Code]
- (Arxiv'25) Absolute Zero: Reinforced Self-play Reasoning with Zero Data [π Paper]
- (Arxiv'25) R-Zero: Self-Evolving Reasoning LLM from Zero Data [π Paper] [π» Code]
- (Arxiv'25) SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [π Paper] [π» Code]
- (ICLR'23) CodeT: Code Generation with Generated Tests [π Paper] [π» Code]
- (ICML'23) LEVER: Learning to Verify Language-to-Code Generation with Execution [π Paper] [π» Code]
- (ESEC/FSE'23) Baldur: Whole-Proof Generation and Repair with Large Language Models [π Paper]
- (ACL'24) Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [π Paper]
- (EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [π Paper] [π» Code]
- (Arxiv'24) Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs [π Paper]
- (ICLR'25) Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [π Paper]
- (Arxiv'25) Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [π Paper] [π» Code]
- (ICLR'23) Self-consistency improves chain of thought reasoning in language models [π Paper]
- (ACL'23) Solving Math Word Problems via Cooperative Reasoning induced Language Models [π Paper] [π» Code]
- (NeurIPS'23) Tree of thoughts: Deliberate problem solving with large language models [π Paper] [π» Code]
- (NeurIPS'24) Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models [π Paper] [π» Code]
- (COLM'24) Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning [π Paper] [π» Code]
- (AAAI'24) Graph of thoughts: Solving elaborate problems with large language models [π Paper] [π» Code]
- (ICML'25) Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [π Paper] [π» Code]
- (EMNLPβ25) START: Selfβtaught Reasoner with Tools [πβ―Paper]
- (ArXivβ25) CoRT: Codeβintegrated Reasoning within Thinking [πβ―Paper] [π»β―Code]
- (EMNLP'22) GPS: Genetic Prompt Search for Efficient Few-shot Learning [π Paper] [π» Code]
- (EACL'23) GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models [π Paper] [π» Code]
- (ICLR'23) TEMPERA: Test-Time Prompting via Reinforcement Learning [π Paper] [π» Code]
- (ACL'24) Plum: Prompt Learning using Metaheuristic [π Paper] [π» Code]
- (ICLR'24) EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers [π Paper] [π» Code]
- (ICML'24) Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution [π Paper]
- (ICLR'23) Large Language Models Are Human-Level Prompt Engineers [π Paper] [π» Code]
- (ICLR'24) PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization [π Paper] [π» Code]
- (ICLR'24) Large Language Models as Optimizers [π Paper] [π» Code]
- (ICLR'24) Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization [π Paper] [π» Code]
- (EMNLP'24) Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs [π Paper] [π» Code]
- (Arxiv'24) Prompt Optimization with Human Feedback [π Paper] [π» Code]
- (Arxiv'24) StraGo: Harnessing Strategic Guidance for Prompt Optimization [π Paper]
- (Arxiv'25) Self-Supervised Prompt Optimization [π Paper]
- (EMNLP'23) Automatic Prompt Optimization with "Gradient Descent" and Beam Search [π Paper] [π» Code]
- (Arxiv'24) TextGrad: Automatic "Differentiation" via Text [π Paper] [π» Code]
- (Arxiv'24) How to Correctly do Semantic Backpropagation on Language-based Agentic Systems [π Paper] [π» Code]
- (Arxiv'24) GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering [π Paper]
- (AAAI'25) Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [π Paper] [π» Code]
- (ICML'25) REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization [π Paper] [π» Code]
- (ICML'24) A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts [π Paper]
- (ICML'24) Agent Workflow Memory [π Paper]
- (AAAI'24) MemoryBank: Enhancing Large Language Models with Long-Term Memory [π Paper]
- (EMNLP'24) GraphReader: Building graph-based agent to enhance long-context [π Paper]
- (Arxiv'24) "My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents [π Paper]
- (ICLR'25) Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [π Paper]
- (ICLR'25) Boosting knowledge intensive reasoning of llms via inference-time hybrid information [π Paper] [π» Code]
- (ACL'25) Improving factuality with explicit working memory [π Paper]
- (Arxiv'25) A-MEM: Agentic Memory for LLM Agents [π Paper]
- (Arxiv'25) Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory [π Paper]
- (Arxiv'25) Memento: Fineβtuningβ―LLMβ―Agentsβ―withoutβ―Fineβtuningβ―LLMs [πβ―Paper] [π»β―Code]
- (Arxiv'25) Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning [πβ―Paper]
- (Arxiv'25) Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [πβ―Paper] [π»β―Code]
- (NeurIPS'23) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [π Paper] [π» Code]
- (ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [π Paper] [π» Code]
- (ACL'24) LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [π Paper] [π» Code]
- (AAAI'24) Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum [π Paper] [π» Code]
- (ICLR'25) Learning Evolving Tools for Large Language Models [π Paper] [π» Code]
- (ICLR'25) Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning [π Paper] [π» Code]
- (ICLR'25) Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [π Paper] [π» Code]
- (Arxiv'25) Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [π Paper]
- (Arxiv'25) ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [π Paper] [π» Code]
- (Arxiv'25) ToolRL: Reward is All Tool Learning Needs [π Paper] [π» Code]
- (Arxiv'25) Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [π Paper] [π» Code]
- (Arxiv'25) Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use [π Paper]
- (Arxiv'25) Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [π Paper] [π» Code]
- (Arxiv'25) Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [π Paper] [π» Code]
- (Arxiv'25) Agentic Reinforced Policy Optimization [π Paper] [π» Code]
- (NAACL'25) EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [π Paper] [π» Code]
- (ICLR'25) From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [π Paper] [π» Code]
- (ACL'25) Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play [π Paper] [π» Code]
- (ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [π Paper] [π» Code]
- (ICLR'24) ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [π Paper]
- (ICLR'25) Tool-Planner: Task Planning with Clusters across Multiple Tools [π Paper] [π» Code]
- (Arxiv'25) MCP-Zero: Active Tool Discovery for Autonomous LLM Agents [π Paper][π» Code]
- (EMNLP'23) CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language model [π Paper] [π» Code]
- (ICML'24) Offline Training of Language Model Agents with Functions as Learnable Weights [π Paper]
- (CVPR'24) CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [π Paper] [π» Code]
- (Arxiv'25) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution [π Paper] [π» Code]
- (Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [π Paper] [π» Code]
- (ICML'25) Multi-Agent Architecture Search via Agentic Supernet [π Paper] [π»Code]
- (ICML'25) MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving [π Paper]
- (ICLR'25) AFlow: Automating Agentic Workflow Generation [π Paper] [π» Code]
- (ICLR'25) WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [π Paper]
- (ICLR'25) Flow: Modularized Agentic Workflow Automation [π Paper]
- (ICLR'25) Automated Design of Agentic Systems [π Paper] [π» Code]
- (Arxiv'25) FlowReasoner: Reinforcing Query-Level Meta-Agents [π Paper]
- (Arxiv'25) AgentNet: Decentralized Evolutionary Coordination for LLM-Based Multi-Agent Systems [π Paper]
- (Arxiv'25) MAS-GPT: Training LLMs to Build LLM-Based Multi-Agent Systems [π Paper]
- (Arxiv'25) FlowAgent: Achieving Compliance and Flexibility for Workflow Agents [π Paper]
- (Arxiv'25) ScoreFlow: Mastering LLM Agent Workflows via Score-Based Preference Optimization [π Paper] [π» Code]
- (Arxiv'25) Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies [π Paper]
- (Arxiv'25) MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [π Paper]
- (Arxiv'25) MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming [π Paper]
- (ICML'24) GPTSwarm: Language Agents as Optimizable Graphs [π Paper] [Code]
- (ICLR'24) DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines [π Paper] [π» Code]
- (ICLR'24) AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [π Paper] [π» Code]
- (ICLR'24) MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework [π Paper] [π» Code]
- (COLM'24) A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration [π Paper]
- (COLM'24) AutoGen: Enabling next-Gen LLM Applications via Multi-Agent Conversations [π Paper] [π» Code]
- (Arxiv'24) G-Designer: Architecting Multi-Agent Communication Topologies via Graph Neural Networks [π Paper]
- (Arxiv'24) AutoFlow: Automated Workflow Generation for Large Language Model Agents [π Paper] [π» Code]
- (Arxiv'24) Symbolic Learning Enables Self-Evolving Agents [π Paper] [π» Code]
- (Arxiv'24) Adaptive In-Conversation Team Building for Language Model Agents [π Paper]
- (Arxiv'25) ChainβofβAgents: EndβtoβEnd Agent Foundation Models via MultiβAgent Distillation and Agentic RL [πβ―Paper] [π»β―Code]
- (Arxivβ25) Agentβ―KB: Leveraging CrossβDomain Experience for Agentic Problem Solving [πβ―Paper] [π»β―Code]
- (EMNLP'24) MMedAgent: Learning to Use Medical Tools with Multi-modal Agent [π Paper] [π» Code]
- (NeurIPS'24) MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making [π Paper] [π» Code]
- (Arxiv'25) HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research [π Paper][π» Code]
- (Arxiv'25) STELLA: Self-Evolving LLM Agent for Biomedical Research [π Paper][π» Code]
- (MICCAI'25) MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions [π Paper] [π» Code]
- (Arxiv'25) PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology [π Paper]
- (Arxiv'25) MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation [π Paper] [π» Code]
- (Arxiv'25) MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow [π Paper] [π» Code]
- (ICLR'24) CACTUS: Chemistry Agent Connecting Tool-Usage to Science [π Paper] [π» Code]
- (NMI'24) ChemCrow: Augmenting large language models with chemistry tools [π Paper] [π» Code]
- (ICLR'25) ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning[π Paper] [π» Code]
- (ICLR'25) OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents [π Paper]
- (Arxiv'25) DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration [π Paper]
- (Arxiv'25) LIDDIA: Language-based Intelligent Drug Discovery Agent [π Paper]
- (Arxiv'23) AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation [π Paper] [π» Code]
- (Arxiv'23) Self-Refine: Iterative Refinement with Self-Feedback [π Paper] [π» Code]
- (EMNLP'24) CodeAgent: Autonomous Communicative Agents for Code Review [π Paper] [π» Code]
- (ICLR'25) OpenHands: An Open Platform for AI Software Developers as Generalist Agents [π Paper] [π» Code]
- (Arxiv'25) CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation [π Paper]
- (Arxivβ25) AlphaEvolve: A coding agent for scientific and algorithmic discovery [πβ―Paper] [π»β―Code]
- (Arxiv'25) Darwin GΓΆdel Machine: Open-Ended Evolution of Self-Improving Agents [π Paper] [π» Code]
- (ACL'23) Self-Edit: Fault-Aware Code Editor for Code Generation [π Paper]
- (ICLR'24) Teaching Large Language Models to Self-Debug [π Paper]
- (ICA'24) RGD: Multi-LLM based agent debugger via refinement and generation guidance. [π Paper]
- (Arxiv'25) Large Language Model Guided Self-Debugging Code Generation [π Paper]
- (Arxivβ25) PiFlow: Principleβaware Scientific Discovery with MultiβAgent Collaboration [πβ―Paper] [π»β―Code]
- (AAAI'24) FinRobot: an open-source ai agent platform for financial applications using large language models [π Paper] [π» Code]
- (Arxiv'24) PEER: Expertizing domain-specific tasks with a multi-agent framework and tuning methods [π Paper] [π» Code]
- (NeurIPS'25) Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making [π Paper] [π» Code]
- (Arxiv'24) LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation [π Paper]
- (ICIC'24) Legalgpt: Legal chain of thought for the legal large language model multi-agent framework [π Paper] [π» Code]
- (Arxiv'24) LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model [π Paper] [π» Code]
- (ACL Findings'25) AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents [π Paper] [π» Code]
- (Arxiv'25) Agents of Change: Self-Evolving LLM Agents for Strategic Planning [π Paper]
- (Arxiv'25) EarthLink: A Self-Evolving AI Agent for Climate Science π Paper [π₯οΈ System]
- (Arxiv'25) SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [π Paper][π» Code]
- (NeurIPS'23) OpenAGI: When LLM Meets Domain Experts [π Paper] [π» Code]
- (Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [π Paper]
- (Arxiv'25) MLGym: A New Framework and Benchmark for Advancing AI Research Agents [π Paper] [π» Code]
- (Arxiv'23) On the Tool Manipulation Capability of Open-source Large Language Models [π Paper] [π» Code]
- (EMNLP'23) API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [π Paper] [π» Code]
- (NeurIPS'23) ToolQA: A Dataset for LLM Question Answering with External Tools [π Paper] [π» Code]
- (ICLR'24) MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [π Paper] [π» Code]
- (ICLR'24) WebArena: A Realistic Web Environment for Building Autonomous Agents [π Paper] [π» Code]
- (Arxiv'25) BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [π Paper] [π» Code]
- (ACL'25) WebWalker: Benchmarking LLMs in Web Traversal [π Paper] [π» Code]
- (ICLR'24) SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [π Paper] [π» Code]
- (Arxiv'25) DataSciBench: An LLM Agent Benchmark for Data Science [π Paper] [π» Code]
- (ICLR'23) GAIA: a benchmark for General AI Assistants [π Paper] [π» Code]
- (ICLR'24) AgentBench: Evaluating LLMs as Agents [π Paper] [π» Code]
- (Arxiv'25) MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [π Paper] [π» Code]
- (Arxiv'25) Benchmarking LLMs' Swarm intelligence [π Paper] [π» Code]
- (ACL'24) Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [π Paper] [π» Code]
- (NeurIPS'24) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [π Paper] [π» Code]
- (ICLR'25) AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [π Paper] [π» Code]
- (Arxiv'24) Towards Better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications [π Paper]
- (Arxiv'24) LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [π Paper]
- (Arxiv'25) LiveIdeaBench: Evaluating LLMsβ Divergent Thinking for Scientific Idea Generation with Minimal Context [π Paper] [π» Code]
- (ACL'25) Auto-Arena: Automating LLM Evaluations with Agent Peer Debate and Committee Voting [π Paper] [π» Code]
- (Arxiv'25) MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [π Paper]
- (Arxiv'24) Agent-as-a-Judge: Evaluate Agents with Agents [π Paper] [π» Code]
- (Arxiv'24) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [π Paper ]
- (NeurIPS'24 β Datasets & Benchmarks) RedCode: Risky Code Execution and Generation [π Paper ]
- (Arxiv'24) MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [π Paper] [π» Code]
- (Arxiv'23) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [π Paper ]
- (Arxiv'24) R-Judge: Benchmarking Safety Risk Awareness for LLM Judges [π Paper] [π» Code]
- (ACL'25) SafeLawBench: Towards Safe Alignment of Large Language Models [π Paper ]
If you find this survey useful in your research and applications, please cite using this BibTeX:
@article{fang2025comprehensive,
title={A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems},
author={Fang, Jinyuan and Peng, Yanwen and Zhang, Xi and Wang, Yingxu and Yi, Xinhao and Zhang, Guibin and Xu, Yi and Wu, Bin and Liu, Siwei and Li, Zihao and others},
journal={arXiv preprint arXiv:2508.07407},
year={2025}
}
We would like to thank Shuyu Guo for his valuable contributions to the early-stage exploration and literature review on agent optimisation.
If you have any questions or suggestions, please feel free to contact us via:
Email: j.fang.2@research.gla.ac.uk and Zaiqiao.Meng@glasgow.ac.uk