Awesome-Self-Evolving-Agents

🤖 We're still cooking — Stay tuned! 🤖
⭐ Give us a star if you like it! ⭐

Figure: A visual taxonomy of AI agent evolution and optimisation techniques, categorised into three major directions: single-agent optimisation, multi-agent optimisation, and domain-specific optimisation. The tree structure illustrates the development of these approaches from 2023 to 2025, including representative methods within each branch.

AI Agents Development Path

Conceptual Framework of the Self-Evolving AI Agents

Open-Source Framework

(Arxiv'25) EvoAgentX: An Automated Framework for Evolving Agentic Workflows [💻 Code] [📝 Paper]

1. Single-Agent Optimisation

1.1 🤖 LLM Behaviour Optimisation

1.1.1 📌 Training-Based Behaviour Optimisation

(1) 🔧 Supervised Fine-Tuning Approaches

(ICLR'24) ToRA: A tool-integrated reasoning agent for mathematical problem solving [📝 Paper] [💻 Code]
(NeurIPS'22) STaR : Bootstrapping reasoning with reasoning [📝 Paper] [💻 Code]
(Arxiv'24) NExT: Teaching large language models to reason about code execution [📝 Paper]
(EMNLP'24) MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning [📝 Paper]

(2) 🔧 Reinforcement Learning Approaches

(ICML'24) Self-Rewarding Language Models [📝 Paper] [💻 Code]
(Arxiv'24) Tulu 3: Pushing Frontiers in Open Language Model Post-Training [📝 Paper] [💻 Code]
(EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [📝 Paper] [💻 Code]
(Arxiv'24) Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [📝 Paper]
(Arxiv'24) DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [📝 Paper]
(ICML'25) Diving into Self-Evolving Training for Multimodal Reasoning [📝 Paper] [💻 Code]
(Arxiv'25) Absolute Zero: Reinforced Self-play Reasoning with Zero Data [📝 Paper]
(Arxiv'25) R-Zero: Self-Evolving Reasoning LLM from Zero Data [📝 Paper] [💻 Code]
(Arxiv'25) SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [📝 Paper] [💻 Code]

1.1.2 📌 Test-Time Behaviour Optimisation

(1) 🔧 Feedback-Based Approaches

(ICLR'23) CodeT: Code Generation with Generated Tests [📝 Paper] [💻 Code]
(ICML'23) LEVER: Learning to Verify Language-to-Code Generation with Execution [📝 Paper] [💻 Code]
(ESEC/FSE'23) Baldur: Whole-Proof Generation and Repair with Large Language Models [📝 Paper]
(ACL'24) Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [📝 Paper]
(EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [📝 Paper] [💻 Code]
(Arxiv'24) Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs [📝 Paper]
(ICLR'25) Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [📝 Paper]
(Arxiv'25) Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [📝 Paper] [💻 Code]

(2) 🔧 Search-Based Approaches

(ICLR'23) Self-consistency improves chain of thought reasoning in language models [📝 Paper]
(ACL'23) Solving Math Word Problems via Cooperative Reasoning induced Language Models [📝 Paper] [💻 Code]
(NeurIPS'23) Tree of thoughts: Deliberate problem solving with large language models [📝 Paper] [💻 Code]
(NeurIPS'24) Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models [📝 Paper] [💻 Code]
(COLM'24) Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning [📝 Paper] [💻 Code]
(AAAI'24) Graph of thoughts: Solving elaborate problems with large language models [📝 Paper] [💻 Code]
(ICML'25) Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [📝 Paper] [💻 Code]

（3）🔧 Reasoning-Based Approaches

(EMNLP’25) START: Self‑taught Reasoner with Tools [📝 Paper]
(ArXiv’25) CoRT: Code‑integrated Reasoning within Thinking [📝 Paper] [💻 Code]

1.2 💬 Prompt Optimisation

1.2.1 📌 Edit-Based Prompt Optimisation

(EMNLP'22) GPS: Genetic Prompt Search for Efficient Few-shot Learning [📝 Paper] [💻 Code]
(EACL'23) GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models [📝 Paper] [💻 Code]
(ICLR'23) TEMPERA: Test-Time Prompting via Reinforcement Learning [📝 Paper] [💻 Code]
(ACL'24) Plum: Prompt Learning using Metaheuristic [📝 Paper] [💻 Code]

1.2.2 📌 Evolutionary Prompt Optimisation

(ICLR'24) EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers [📝 Paper] [💻 Code]
(ICML'24) Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution [📝 Paper]

1.2.3 📌 Generative Prompt Optimisation

(ICLR'23) Large Language Models Are Human-Level Prompt Engineers [📝 Paper] [💻 Code]
(ICLR'24) PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization [📝 Paper] [💻 Code]
(ICLR'24) Large Language Models as Optimizers [📝 Paper] [💻 Code]
(ICLR'24) Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization [📝 Paper] [💻 Code]
(EMNLP'24) Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs [📝 Paper] [💻 Code]
(Arxiv'24) Prompt Optimization with Human Feedback [📝 Paper] [💻 Code]
(Arxiv'24) StraGo: Harnessing Strategic Guidance for Prompt Optimization [📝 Paper]
(Arxiv'25) Self-Supervised Prompt Optimization [📝 Paper]

1.2.4 📌 Text Gradient-Based Prompt Optimisation

(EMNLP'23) Automatic Prompt Optimization with "Gradient Descent" and Beam Search [📝 Paper] [💻 Code]
(Arxiv'24) TextGrad: Automatic "Differentiation" via Text [📝 Paper] [💻 Code]
(Arxiv'24) How to Correctly do Semantic Backpropagation on Language-based Agentic Systems [📝 Paper] [💻 Code]
(Arxiv'24) GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering [📝 Paper]
(AAAI'25) Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [📝 Paper] [💻 Code]
(ICML'25) REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization [📝 Paper] [💻 Code]

1.3 🧠 Memory Optimization

(ICML'24) A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts [📝 Paper]
(ICML'24) Agent Workflow Memory [📝 Paper]
(AAAI'24) MemoryBank: Enhancing Large Language Models with Long-Term Memory [📝 Paper]
(EMNLP'24) GraphReader: Building graph-based agent to enhance long-context [📝 Paper]
(Arxiv'24) "My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents [📝 Paper]
(ICLR'25) Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [📝 Paper]
(ICLR'25) Boosting knowledge intensive reasoning of llms via inference-time hybrid information [📝 Paper] [💻 Code]
(ACL'25) Improving factuality with explicit working memory [📝 Paper]
(Arxiv'25) A-MEM: Agentic Memory for LLM Agents [📝 Paper]
(Arxiv'25) Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory [📝 Paper]
(Arxiv'25) Memento: Fine‑tuning LLM Agents without Fine‑tuning LLMs [📝 Paper] [💻 Code]
(Arxiv'25) Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning [📝 Paper]
(Arxiv'25) Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [📝 Paper] [💻 Code]

1.4 🧰 Tool Optimization

1.4.1 📌 Training-Based Tool Optimisation

(1) Supervised Fine-Tuning for Tool Optimisation

(NeurIPS'23) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [📝 Paper] [💻 Code]
(ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [📝 Paper] [💻 Code]
(ACL'24) LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [📝 Paper] [💻 Code]
(AAAI'24) Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum [📝 Paper] [💻 Code]
(ICLR'25) Learning Evolving Tools for Large Language Models [📝 Paper] [💻 Code]
(ICLR'25) Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning [📝 Paper] [💻 Code]
(ICLR'25) Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [📝 Paper] [💻 Code]
(Arxiv'25) Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [📝 Paper]

(2) Reinforcement Learning for Tool Optimisation

(Arxiv'25) ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [📝 Paper] [💻 Code]
(Arxiv'25) ToolRL: Reward is All Tool Learning Needs [📝 Paper] [💻 Code]
(Arxiv'25) Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [📝 Paper] [💻 Code]
(Arxiv'25) Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use [📝 Paper]
(Arxiv'25) Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [📝 Paper] [💻 Code]
(Arxiv'25) Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [📝 Paper] [💻 Code]
(Arxiv'25) Agentic Reinforced Policy Optimization [📝 Paper] [💻 Code]

1.4.2 📌 Inference-Time Tool Optimisation

(1) Prompt-Based Optimisation

(NAACL'25) EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [📝 Paper] [💻 Code]
(ICLR'25) From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [📝 Paper] [💻 Code]
(ACL'25) Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play [📝 Paper] [💻 Code]

(2) Reasoning-Based Optimisation

(ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [📝 Paper] [💻 Code]
(ICLR'24) ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [📝 Paper]
(ICLR'25) Tool-Planner: Task Planning with Clusters across Multiple Tools [📝 Paper] [💻 Code]
(Arxiv'25) MCP-Zero: Active Tool Discovery for Autonomous LLM Agents [📝 Paper][💻 Code]

1.4.3 📌 Tool Functionality Optimisation

(EMNLP'23) CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language model [📝 Paper] [💻 Code]
(ICML'24) Offline Training of Language Model Agents with Functions as Learnable Weights [📝 Paper]
(CVPR'24) CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [📝 Paper] [💻 Code]
(Arxiv'25) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution [📝 Paper] [💻 Code]

1.5 🧰 Unified Optimization

(Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [📝 Paper] [💻 Code]

2. Multi-Agent Optimisation

(ICML'25) Multi-Agent Architecture Search via Agentic Supernet [📝 Paper] [💻Code]
(ICML'25) MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving [📝 Paper]
(ICLR'25) AFlow: Automating Agentic Workflow Generation [📝 Paper] [💻 Code]
(ICLR'25) WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [📝 Paper]
(ICLR'25) Flow: Modularized Agentic Workflow Automation [📝 Paper]
(ICLR'25) Automated Design of Agentic Systems [📝 Paper] [💻 Code]
(Arxiv'25) FlowReasoner: Reinforcing Query-Level Meta-Agents [📝 Paper]
(Arxiv'25) AgentNet: Decentralized Evolutionary Coordination for LLM-Based Multi-Agent Systems [📝 Paper]
(Arxiv'25) MAS-GPT: Training LLMs to Build LLM-Based Multi-Agent Systems [📝 Paper]
(Arxiv'25) FlowAgent: Achieving Compliance and Flexibility for Workflow Agents [📝 Paper]
(Arxiv'25) ScoreFlow: Mastering LLM Agent Workflows via Score-Based Preference Optimization [📝 Paper] [💻 Code]
(Arxiv'25) Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies [📝 Paper]
(Arxiv'25) MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [📝 Paper]
(Arxiv'25) MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming [📝 Paper]
(ICML'24) GPTSwarm: Language Agents as Optimizable Graphs [📝 Paper] [Code]
(ICLR'24) DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines [📝 Paper] [💻 Code]
(ICLR'24) AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [📝 Paper] [💻 Code]
(ICLR'24) MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework [📝 Paper] [💻 Code]
(COLM'24) A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration [📝 Paper]
(COLM'24) AutoGen: Enabling next-Gen LLM Applications via Multi-Agent Conversations [📝 Paper] [💻 Code]
(Arxiv'24) G-Designer: Architecting Multi-Agent Communication Topologies via Graph Neural Networks [📝 Paper]
(Arxiv'24) AutoFlow: Automated Workflow Generation for Large Language Model Agents [📝 Paper] [💻 Code]
(Arxiv'24) Symbolic Learning Enables Self-Evolving Agents [📝 Paper] [💻 Code]
(Arxiv'24) Adaptive In-Conversation Team Building for Language Model Agents [📝 Paper]
(Arxiv'25) Chain‑of‑Agents: End‑to‑End Agent Foundation Models via Multi‑Agent Distillation and Agentic RL [📝 Paper] [💻 Code]
(Arxiv’25) Agent KB: Leveraging Cross‑Domain Experience for Agentic Problem Solving [📝 Paper] [💻 Code]

3. Domain-Specific Optimisation

3.1 🧬 Biomedicine

3.1.1 📌 Medical Diagnosis

(EMNLP'24) MMedAgent: Learning to Use Medical Tools with Multi-modal Agent [📝 Paper] [💻 Code]
(NeurIPS'24) MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making [📝 Paper] [💻 Code]
(Arxiv'25) HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research [📝 Paper][💻 Code]
(Arxiv'25) STELLA: Self-Evolving LLM Agent for Biomedical Research [📝 Paper][💻 Code]
(MICCAI'25) MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions [📝 Paper] [💻 Code]
(Arxiv'25) PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology [📝 Paper]
(Arxiv'25) MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation [📝 Paper] [💻 Code]
(Arxiv'25) MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow [📝 Paper] [💻 Code]

3.1.2 📌 Molecular Discovery

(ICLR'24) CACTUS: Chemistry Agent Connecting Tool-Usage to Science [📝 Paper] [💻 Code]
(NMI'24) ChemCrow: Augmenting large language models with chemistry tools [📝 Paper] [💻 Code]
(ICLR'25) ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning[📝 Paper] [💻 Code]
(ICLR'25) OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents [📝 Paper]
(Arxiv'25) DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration [📝 Paper]
(Arxiv'25) LIDDIA: Language-based Intelligent Drug Discovery Agent [📝 Paper]

3.2 💻 Programming

3.2.1 📌 Code Refinement

(Arxiv'23) AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation [📝 Paper] [💻 Code]
(Arxiv'23) Self-Refine: Iterative Refinement with Self-Feedback [📝 Paper] [💻 Code]
(EMNLP'24) CodeAgent: Autonomous Communicative Agents for Code Review [📝 Paper] [💻 Code]
(ICLR'25) OpenHands: An Open Platform for AI Software Developers as Generalist Agents [📝 Paper] [💻 Code]
(Arxiv'25) CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation [📝 Paper]
(Arxiv’25) AlphaEvolve: A coding agent for scientific and algorithmic discovery [📝 Paper] [💻 Code]
(Arxiv'25) Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents [📝 Paper] [💻 Code]

3.2.2 📌 Code Debugging

(ACL'23) Self-Edit: Fault-Aware Code Editor for Code Generation [📝 Paper]
(ICLR'24) Teaching Large Language Models to Self-Debug [📝 Paper]
(ICA'24) RGD: Multi-LLM based agent debugger via refinement and generation guidance. [📝 Paper]
(Arxiv'25) Large Language Model Guided Self-Debugging Code Generation [📝 Paper]

3.3 Scientific Research

(Arxiv’25) PiFlow: Principle‑aware Scientific Discovery with Multi‑Agent Collaboration [📝 Paper] [💻 Code]

3.4 💰📚 Financial and Legal Research

3.4.1 📌 Financial Decision-Making

(AAAI'24) FinRobot: an open-source ai agent platform for financial applications using large language models [📝 Paper] [💻 Code]
(Arxiv'24) PEER: Expertizing domain-specific tasks with a multi-agent framework and tuning methods [📝 Paper] [💻 Code]
(NeurIPS'25) Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making [📝 Paper] [💻 Code]

3.4.2 📌 Legal Reasoning

(Arxiv'24) LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation [📝 Paper]
(ICIC'24) Legalgpt: Legal chain of thought for the legal large language model multi-agent framework [📝 Paper] [💻 Code]
(Arxiv'24) LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model [📝 Paper] [💻 Code]
(ACL Findings'25) AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents [📝 Paper] [💻 Code]

3.5 🧩 Other Domain-Specific Optimisation

(Arxiv'25) Agents of Change: Self-Evolving LLM Agents for Strategic Planning [📝 Paper]
(Arxiv'25) EarthLink: A Self-Evolving AI Agent for Climate Science 📝 Paper [🖥️ System]
(Arxiv'25) SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [📝 Paper][💻 Code]

4. Evaluation

4.1 📈 Benchmark-Based Evaluation

(NeurIPS'23) OpenAGI: When LLM Meets Domain Experts [📝 Paper] [💻 Code]
(Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [📝 Paper]
(Arxiv'25) MLGym: A New Framework and Benchmark for Advancing AI Research Agents [📝 Paper] [💻 Code]

4.1.1 📌 Tool and API-Driven Agents

(Arxiv'23) On the Tool Manipulation Capability of Open-source Large Language Models [📝 Paper] [💻 Code]
(EMNLP'23) API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [📝 Paper] [💻 Code]
(NeurIPS'23) ToolQA: A Dataset for LLM Question Answering with External Tools [📝 Paper] [💻 Code]
(ICLR'24) MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [📝 Paper] [💻 Code]

4.1.2 📌 Web Navigation and Browsing Agents

(ICLR'24) WebArena: A Realistic Web Environment for Building Autonomous Agents [📝 Paper] [💻 Code]
(Arxiv'25) BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [📝 Paper] [💻 Code]
(ACL'25) WebWalker: Benchmarking LLMs in Web Traversal [📝 Paper] [💻 Code]

4.1.3 📌 Coding Agents

(ICLR'24) SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [📝 Paper] [💻 Code]

4.1.4 Scientific Research Agents

(Arxiv'25) DataSciBench: An LLM Agent Benchmark for Data Science [📝 Paper] [💻 Code]

4.1.4 📌 Multi-Agent Collaboration and Generalists

(ICLR'23) GAIA: a benchmark for General AI Assistants [📝 Paper] [💻 Code]
(ICLR'24) AgentBench: Evaluating LLMs as Agents [📝 Paper] [💻 Code]
(Arxiv'25) MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [📝 Paper] [💻 Code]
(Arxiv'25) Benchmarking LLMs' Swarm intelligence [📝 Paper] [💻 Code]

4.1.5 📌 GUI and Multimodal Environment Agents

(ACL'24) Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [📝 Paper] [💻 Code]
(NeurIPS'24) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [📝 Paper] [💻 Code]
(ICLR'25) AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [📝 Paper] [💻 Code]

4.2 ⚖️ LLM-Based Evaluation

4.2.1 📌 LLM-as-a-Judge

(Arxiv'24) Towards Better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications [📝 Paper]
(Arxiv'24) LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [📝 Paper]
(Arxiv'25) LiveIdeaBench: Evaluating LLMs’ Divergent Thinking for Scientific Idea Generation with Minimal Context [📝 Paper] [💻 Code]
(ACL'25) Auto-Arena: Automating LLM Evaluations with Agent Peer Debate and Committee Voting [📝 Paper] [💻 Code]
(Arxiv'25) MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [📝 Paper]

4.2.2 📌 Agent-as-a-Judge

(Arxiv'24) Agent-as-a-Judge: Evaluate Agents with Agents [📝 Paper] [💻 Code]

4.3 🛡 Safety, Alignment, and Robustness for Lifelong / Self-Evolving Agents

(Arxiv'24) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [📝 Paper ]
(NeurIPS'24 – Datasets & Benchmarks) RedCode: Risky Code Execution and Generation [📝 Paper ]
(Arxiv'24) MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [📝 Paper] [💻 Code]
(Arxiv'23) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [📝 Paper ]
(Arxiv'24) R-Judge: Benchmarking Safety Risk Awareness for LLM Judges [📝 Paper] [💻 Code]
(ACL'25) SafeLawBench: Towards Safe Alignment of Large Language Models [📝 Paper ]

📚 Citation

If you find this survey useful in your research and applications, please cite using this BibTeX:

@article{fang2025comprehensive,
  title={A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems},
  author={Fang, Jinyuan and Peng, Yanwen and Zhang, Xi and Wang, Yingxu and Yi, Xinhao and Zhang, Guibin and Xu, Yi and Wu, Bin and Liu, Siwei and Li, Zihao and others},
  journal={arXiv preprint arXiv:2508.07407},
  year={2025}
}

☕ Acknowledgement

We would like to thank Shuyu Guo for his valuable contributions to the early-stage exploration and literature review on agent optimisation.

✉️ Contact Us

If you have any questions or suggestions, please feel free to contact us via:

Email: j.fang.2@research.gla.ac.uk and Zaiqiao.Meng@glasgow.ac.uk

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

License

EvoAgentX/Awesome-Self-Evolving-Agents

Folders and files

Latest commit

History

Repository files navigation