Why AI Agents Loop (And How to Stop Them) #58
bmdhodl
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Every team building with AI agents hits the same wall. Your agent works fine in testing, you deploy it, and somewhere around 2 AM it burns through your API budget calling the same tool 200 times in a row.
This isn't a rare bug. It's the default failure mode of any agentic system without runtime guardrails.
The Problem
AI agents enter infinite tool-call loops. Not occasionally — routinely.
Real examples from the wild:
search()200+ times with the same query because the results never quite satisfied the model's criteriaThese aren't edge cases from toy demos. They're production incidents from teams running agents with real users and real money on the line.
Why It Happens
Three root causes explain nearly every agent loop:
1. Models ignore prompt-level stop instructions
You can write "NEVER call search more than 3 times" in your system prompt. The model will comply most of the time. But LLMs are probabilistic — they don't execute instructions, they predict tokens. Under the right (wrong) conditions, the model will confidently ignore your instruction and keep going.
Prompt-level guardrails are suggestions, not constraints.
2. Unsatisfying tool results trigger infinite retries
When a tool returns results that don't match what the model expects, many agents will retry with the same or nearly identical arguments. The model "thinks" it needs to try again, but the tool output is deterministic — same input, same output, forever.
This is especially common with search and retrieval tools where the model has a specific answer in mind and the corpus doesn't contain it.
3. Multi-agent systems cascade failures
Agent A asks Agent B for data. Agent B fails and returns an error. Agent A retries. Agent B fails again. Now multiply this across a graph of 5-10 agents, each with their own retry logic, and you get exponential failure cascading.
One stuck agent can drag the entire system into a loop.
Why
max_iterationsIsn't EnoughLangChain's
max_iterationsparameter is the most common "fix" people reach for. Set it to 25 and the agent stops after 25 steps. Problem solved?No.
max_iterationsis a blunt instrument. It caps total steps regardless of whether they're productive.You need to detect the pattern, not just count steps. An agent that calls
search("python async")thenread_file("main.py")thenwrite_file("main.py")over 50 steps is working. An agent that callssearch("python async")three times in a row is stuck.max_iterationscan't tell the difference.Worse, setting it too low kills legitimate long-running workflows. Setting it too high means you're still burning tokens on loops before the cap kicks in. There's no good number because the right limit depends on what the agent is doing, not how many steps it's taken.
Runtime Guards: The Right Approach
The fix is to check before each tool execution and raise an exception to forcibly break the loop when a bad pattern is detected. Not after the fact in logs. Not via prompt instructions the model might ignore. At runtime, in code, with real enforcement.
Three guard types cover the majority of failure modes:
LoopGuard — Detects identical or near-identical tool calls within a sliding window. If the same function is called with the same arguments N times in the last M calls, something is wrong.
BudgetGuard — Enforces hard limits on token consumption, API call count, or dollar cost. When the budget is spent, the agent stops. No exceptions.
TimeoutGuard — Wall-clock time limits. If an agent run exceeds N seconds, it's terminated. Catches slow-burn loops that stay under call-count limits by spacing out requests.
LoopGuard.check()raisesLoopDetectedif it sees 3 identicalsearchcalls in the last 6 tool invocations.BudgetGuard.consume()raisesBudgetExceededwhen cumulative cost crosses $5.00.TimeoutGuard.check()raisesTimeoutExceededafter 120 seconds.These are real Python exceptions. They propagate up the call stack and stop execution immediately. The model doesn't get a chance to "decide" whether to keep going — the runtime decides for it.
This is the key insight: guardrails must operate at the runtime level, not the prompt level. You can't ask a stuck model to unstick itself. You have to forcibly intervene.
LangChain Integration
If you're using LangChain, you don't need to wire guards into every tool call manually. A single callback handler does it:
The handler hooks into
on_tool_startand runs guard checks before every tool execution. If a guard trips, the exception propagates and the agent run terminates cleanly. Your existing agent code doesn't change — you just add the callback.Try It
Zero dependencies. Python 3.9+. MIT licensed.
The full source is in this repo under
sdk/agentguard/. The guards are inguards.py, the LangChain integration is inintegrations/langchain.py, and there's a working demo inexamples/demo_agent.py.If you've dealt with agent loops in production, I'd like to hear about it — what patterns you saw, what worked, what didn't. Drop a comment below.
Beta Was this translation helpful? Give feedback.
All reactions