Stop your AI agents from failing silently. Four battle-tested reliability patterns (circuit breaker, partial success, human-in-the-loop, graceful degradation) for Trigger.dev v4 — with tests, docs, and production upgrade paths.
Most AI agent tutorials show the happy path: LLM responds, task succeeds, everyone's happy. Real production systems need to handle:
- ❌ Cascading failures when your LLM provider is down
- ❌ Partial batch failures (95 items succeed, 5 fail—now what?)
- ❌ Edge cases where AI can't decide and needs human judgment
- ❌ Rate limits forcing you to fall back to cheaper models
This project codifies 4 patterns to handle these scenarios, implemented on Trigger.dev v4 with:
- ✅ Standalone CLI tests (no server needed, runs in ~3ms)
- ✅ Production upgrade paths (Redis, Postgres, real LLMs, Slack, Sentry)
- ✅ Comprehensive docs (testing, deployment, monitoring, cost analysis)
- ✅ Copy-paste ready for your own agent workflows
| Pattern | Problem | Solution | Use Case |
|---|---|---|---|
| 🔴 Circuit Breaker | Upstream service failing repeatedly | Stop trying after N failures, fail fast during cooldown | Prevent wasting $$$ on 1000 failed OpenAI calls |
| 🟡 Partial Success | Batch operations where some items fail | Process individually, retry only failures, track per-item results | 100 documents: 95 succeed, 5 fail with reasons |
| 🟠 Human Escalation | AI hits edge case it can't resolve | Pause workflow, notify human, resume with token | LLM can't parse ambiguous form → human clarifies |
| 🟢 Graceful Degradation | Primary service down or rate-limited | Fall back: GPT-4 → Claude → template response | Maintain 100% uptime, reduce costs during spikes |
# Clone and test
git clone https://github.com/tanayshah11/ai-agent-error-patterns.git
cd ai-agent-error-patterns
pnpm install
# Run all 4 patterns in ~3ms
pnpm testExpected output:
✅ Passed: 4/4
❌ Failed: 0/4
⏱️ Duration: 3ms
cp .env.example .env
# Add your TRIGGER_API_KEY
pnpm dev
# Visit http://localhost:3030Trigger tasks via UI:
agents.circuit-breakeragents.partial-successagents.human-escalationagents.graceful-degradationagents.test-all-patterns(runs all 4)
Prevents cascade failures when upstream services are down.
// src/trigger/circuitBreaker.ts (55 LOC)
// Tracks failures, opens circuit after 5 consecutive fails
// Fails fast (503) during cooldown, auto-closes after recoveryTest:
pnpm test:circuitSample Output:
{
"ok": false,
"attempts": 20,
"okCount": 17,
"failCount": 3,
"consecutivelyTripped": false,
"openUntil": null,
"durationMs": 2
}Production Upgrade: Use Redis to persist circuit state across instances (see PRODUCTION.md)
Process batches where some items fail but others succeed.
// src/trigger/partialSuccess.ts (40 LOC)
// Processes items individually with retry logic
// Distinguishes fatal (TOKEN_LIMIT) vs retryable (RATE_LIMIT) errors
// Returns per-item results with attempt countsTest:
pnpm test:partialSample Output:
{
"ok": true,
"okCount": 8,
"failedCount": 0,
"durationMs": 0,
"results": {
"item:a": { "ok": true, "attempts": 1 },
"item:b": { "ok": true, "attempts": 1 },
...
}
}Production Upgrade: Add database persistence for idempotency keys (see PRODUCTION.md)
Pause workflows when AI hits edge cases requiring human judgment.
// src/trigger/humanEscalation.ts (26 LOC)
// First run: escalates, returns resumeToken
// Second run: validates token, resumes executionTest:
pnpm test:escalationSample Output (Escalation):
{
"ok": false,
"escalated": true,
"message": "Escalated for review. Re-run with resumeToken to continue.",
"resumeToken": "RESUME-123",
"durationMs": 0
}Sample Output (Resume):
{
"ok": true,
"resumed": true,
"durationMs": 0
}Production Upgrade: Integrate Slack webhooks, secure token storage with expiry (see PRODUCTION.md)
Maintain 100% uptime by falling back to cheaper/faster models.
// src/trigger/gracefulDegradation.ts (38 LOC)
// Try primary (GPT-4) → secondary (Claude) → template
// Always returns a response, tracks degraded stateTest:
pnpm test:degradationSample Output:
{
"ok": true,
"model": "secondary",
"degraded": true,
"output": "fallback(Explain the )",
"durationMs": 0
}Production Upgrade: Add real LLM APIs, cost tracking, response caching (see PRODUCTION.md)
├── src/trigger/
│ ├── circuitBreaker.ts # Circuit breaker (55 LOC)
│ ├── partialSuccess.ts # Batch retry (40 LOC)
│ ├── humanEscalation.ts # Pause/resume (26 LOC)
│ ├── gracefulDegradation.ts # Multi-tier fallback (38 LOC)
│ └── testRunner.ts # Dashboard test suite
├── test-standalone.ts # CLI test runner (no server needed)
├── CLI-TESTING.md # Command-line testing guide
├── TESTING.md # Comprehensive test scenarios
├── PRODUCTION.md # Full deployment guide
│ ├── Prisma schema (5 models)
│ ├── Redis integration
│ ├── Real LLM APIs (OpenAI, Anthropic)
│ ├── Slack webhooks
│ ├── Sentry monitoring
│ ├── Cost optimization
│ └── Security checklist
├── .env.example # Environment template
└── trigger.config.ts # Trigger.dev v4 config
# All patterns
pnpm test
# Individual patterns
pnpm test:circuit # Circuit breaker
pnpm test:partial # Partial success
pnpm test:escalation # Human escalation
pnpm test:degradation # Graceful degradationWhy CLI tests?
- ✅ Runs in ~3ms (perfect for CI/CD)
- ✅ No Trigger.dev server needed
- ✅ No external APIs required (mock mode)
- ✅ Exit code 0 on success, 1 on failure
pnpm dev
# Visit http://localhost:3030
# Trigger any task: agents.circuit-breaker, agents.partial-success, etc.See CLI-TESTING.md for CI/CD integration examples.
Ready to deploy? PRODUCTION.md includes:
- ✅ Upstash Redis (circuit breaker state)
- ✅ Neon/Supabase PostgreSQL (idempotency, tokens, audit logs)
- ✅ Prisma schema (5 models: CircuitBreakerState, EscalationToken, BatchItem, AuditLog, LLMUsage)
- ✅ OpenAI + Anthropic APIs (real LLM calls)
- ✅ Slack webhooks (escalation notifications)
- ✅ Sentry (error tracking)
- ✅ OpenTelemetry (tracing)
- ✅ Deployment steps
- ✅ Alerts & runbooks
- ✅ Performance tuning
- ✅ Cost optimization (LLM caching, model tiers)
- ✅ Security checklist
Estimated cost: $40-70/month (low volume) | $150-500/month (high volume)
| Doc | Purpose |
|---|---|
| README.md | This file—quick start and overview |
| CLI-TESTING.md | Command-line testing, CI/CD integration |
| TESTING.md | Detailed test scenarios, edge cases, observability |
| PRODUCTION.md | Full deployment guide with infrastructure |
| Pattern | Scenario | Without | With |
|---|---|---|---|
| Circuit Breaker | OpenAI API is down | 1000 failed requests, wasted $$$ | Circuit opens after 5 failures, fails fast |
| Partial Success | Process 100 documents, 5 invalid | Entire batch fails | 95 succeed, 5 fail with detailed reasons |
| Human Escalation | LLM can't parse form | Stuck in retry loop | Pauses, notifies human, resumes after fix |
| Graceful Degradation | GPT-4 rate limit hit | All requests fail | Falls back to Claude → template |
- Trigger.dev v4 - Background job orchestration
- TypeScript 5.5 - Type-safe development
- tsx - Fast TypeScript execution
- Mock mode - No external APIs required for testing
- Production-ready - Prisma, Redis, LLMs, Slack, Sentry integrations
name: Test AI Agent Patterns
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: pnpm/action-setup@v2
- uses: actions/setup-node@v3
with:
node-version: "20"
cache: "pnpm"
- run: pnpm install
- run: pnpm test#!/bin/sh
pnpm test || exit 1- Clone & Test:
git clone→pnpm install→pnpm test - Read Docs: Start with CLI-TESTING.md
- Run Dashboard:
pnpm dev→ http://localhost:3030 - Go Production: Follow PRODUCTION.md for deployment
Found a bug? Have a pattern to add? PRs welcome!
- Fork the repo
- Create a branch:
git checkout -b feature/new-pattern - Make changes and test:
pnpm test - Submit a PR
MIT — use freely in your own projects.
Built with Trigger.dev v4 to demonstrate production-grade error handling for AI agent workflows.
Questions? Open an issue or reach out on Twitter | LinkedIn
If this helped you build more reliable AI agents, consider giving it a ⭐!