The moment your data pipeline needs judgment, the economics change.
We asked Claude to collect financial data on 1,000 companies. It started inventing earnings numbers.
Not maliciously - it saw tedious, repetitive work and took shortcuts. This is a documented anti-pattern:
"LLMs are 'lazy learners' that tend to exploit shortcuts in prompts for downstream tasks."
β arXiv:2305.17256
"Larger models are MORE likely to utilize shortcuts during inference."
β Same paper. Counterintuitive but documented.
"An LLM tends to behave like humans: it often goes for the easiest answer rather than the best one."
β Towards Data Science
Even with RAG and best practices, hallucination rates remain 5-20% on complex tasks (2026 benchmarks). When LLMs face bulk tedious work, they fabricate to "complete" rather than admit "I can't fetch this."
The solution: separate what LLMs are BAD at (tedious collection) from what they're GOOD at (pattern recognition).
| Task Type | LLM Behavior | Who Should Do It |
|---|---|---|
| Tedious data gathering | Takes shortcuts, hallucinates | Donkeys (mechanical scripts) |
| Pattern recognition | Actually excellent | Claude (expensive AI) |
| Validation (yes/no questions) | Good and cheap | Kong (local LLM) |
This is "Kong in the Loop" architecture.
If your validation can be done with regex, use a for-loop with time.sleep().
If your validation requires reasoning, you need an LLM.
If you need an LLM at 10,000+ entities, you can't afford cloud APIs.
That's why DonkeyKong exists.
"Expensive intelligence does the work once. Cheap intelligence challenges it many times. Only failures go back to expensive intelligence."
This is how humans review work:
- Senior analyst does the analysis
- Junior analyst checks it, asks questions
- Senior only re-reviews what junior flagged
DonkeyKong implements this with AI:
- Claude/GPT-4 (expensive) does deep analysis
- Kong (local Ollama, free) validates and challenges
- Only low-confidence items get reanalyzed
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β "Kong in the Loop" Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PHASE 1: MECHANICAL COLLECTION (Donkeys - no LLM) β
β βββββββββββ βββββββββββ βββββββββββ β
β βWorker 1 β βWorker 2 β βWorker N β β Raw Data β
β β(scripts)β β(scripts)β β(scripts)β (real, not invented) β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β ββββββββββββββΌβββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β π¦ Kong: DATA VALIDATION (free) β β LLM HERE β
β β β’ "Is this response complete?" β β
β β β’ "Did we get all 12 quarters?" β β
β β β’ Catches collection failures β β
β ββββββββββββββββββ¬βββββββββββββββββββββββββ β
β βΌ (verified REAL data) β
β β
β PHASE 2: INTELLIGENT ANALYSIS (Claude - expensive) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Pattern recognition on VERIFIED data β β
β β β’ Cannot invent inputs (they're real) β β
β β β’ Does what LLMs are good at β β
β β β Scores, patterns, conclusions β β
β ββββββββββββββββββ¬βββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β π¦ Kong: ADVERSARIAL VALIDATION (free) β β LLM HERE β
β β β’ "Did you USE all the data I gave you?"β β
β β β’ "Your score doesn't match evidence" β β
β β β’ "What would change your conclusion?" β β
β β β’ Catches bullshit analysis β β
β ββββββββββββββββββ¬βββββββββββββββββββββββββ β
β βΌ β
β PHASE 3: TARGETED RERUN (only ~15% failures) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Only low-confidence items reanalyzed β β
β β + Missing data added β β
β β + Adversarial questions addressed β β
β β β β
β β Cost: 85% less than rerunning all β β
β βββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The key insight: validation is easier than generation.
| Task | Difficulty | Model Needed |
|---|---|---|
| Generate 12 quarters of earnings data | HARD (will hallucinate) | None - use scripts |
| "Is this JSON complete?" | EASY | Cheap local LLM |
| Analyze patterns in verified data | MEDIUM | Expensive cloud LLM |
| "Did you cite all 6 sources?" | EASY | Cheap local LLM |
Kong can run unlimited passes at $0 cost because validation is:
- Answering yes/no questions about data that EXISTS
- Checking if conclusions match evidence
- Asking adversarial questions
Claude only does the middle part - the actual intelligence work.
Donkeys collect β Kong validates quality β Retry failures
from donkeykong import Pipeline, OllamaValidator
pipeline = Pipeline(entities=urls, kong=OllamaValidator())
pipeline.run() # Kong validates each collected itemClaude analyzes β Kong challenges β Rerun low-confidence only
from donkeykong.kong import AdversarialValidator
validator = AdversarialValidator()
for entity, analysis, raw_data in results:
result = validator.validate(entity, analysis, raw_data)
if result.should_rerun:
reanalyze(entity, questions=result.adversarial_questions)- Donkey = Load-bearing Docker workers hauling data (pack animals doing the heavy lifting)
- Kong = Local LLM sitting on top, managing and QC'ing the output (the king overseeing the donkeys)
| Approach | Collection | Validation | Cost at 10K entities |
|---|---|---|---|
| Python script + sleep | Sequential | Regex/schema only | $0 but dumb |
| Python script + cloud LLM | Sequential | Intelligent | $100-500 |
| DonkeyKong | Parallel | Intelligent + local | ~$0 |
- π΄ Distributed Workers: Docker containers with range-based task assignment
- π¦ Local LLM QC: Ollama integration for intelligent validation (Llama, Mistral, Phi)
- π Real-time Monitoring: Redis pub/sub for progress tracking
- π Fault Tolerance: Automatic retry with configurable strategies
- πΎ Checkpointing: Resume from failures without losing progress
- π Three Interfaces: CLI, Python API, and MCP Server
pip install donkeykong
# Collect URLs with quality validation
dk collect urls.txt --workers 10 --validator quality_check
# Monitor progress
dk status
# Retry failures with different strategy
dk retry --strategy aggressivefrom donkeykong import Pipeline, OllamaValidator
# Define your collector
class MyCollector(Pipeline):
def collect(self, entity):
# Your collection logic
return {"data": fetch_data(entity)}
def validate(self, entity, data):
# Kong validates with local LLM
return self.kong.validate(data,
prompt="Is this data complete and accurate?")
# Run distributed collection
pipeline = MyCollector(
entities=my_entity_list,
workers=10,
kong=OllamaValidator(model="llama3.2")
)
pipeline.run()Add to your Claude Desktop config:
{
"mcpServers": {
"donkeykong": {
"command": "dk",
"args": ["mcp-server"]
}
}
}Then talk to Claude:
"Start collecting these 1000 URLs and validate each page has pricing information"
"How's the collection going?"
"These 12 failed - retry them with a different user agent"
# Core package
pip install donkeykong
# With Ollama support (recommended)
pip install donkeykong[ollama]
# Full installation with MCP
pip install donkeykong[full]- Docker & Docker Compose
- Redis (included in docker-compose)
- Ollama (optional, for Kong LLM validation)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.2A complete working example that collects Wikipedia articles and uses a local LLM to assess content quality:
cd examples/wikipedia_quality
docker-compose upSee examples/wikipedia_quality/README.md for details.
- Isolation: Each worker runs in its own container
- Scalability:
docker-compose up --scale worker=100 - Reproducibility: Same environment everywhere
- Coordination: Workers claim tasks atomically
- Real-time: Pub/sub for instant progress updates
- Fault tolerance: Workers can restart without losing progress
- Cost: $0 per validation vs $0.01+ per API call
- Speed: No rate limits, no network latency
- Privacy: Data never leaves your infrastructure
- Unlimited retries: Validate as many times as needed
Kong works without Ollama installed, but with reduced capability:
| Feature | With Ollama | Without Ollama |
|---|---|---|
| Rule-based validation | β Full | β Full |
| Completeness checking | β Full | β Full |
| Consistency checking | β Full | β Full |
| Logic checking | β Full | β Full |
| Adversarial questions | β LLM-generated + rules | |
| Deep semantic analysis | β Yes | β No |
Without Ollama, Kong still catches:
- Missing data sources
- High confidence with low data quality
- Extreme scores without evidence
- Recommendations without findings
With Ollama, Kong additionally:
- Generates deeper adversarial questions
- Performs semantic analysis of findings
- Catches subtle logical inconsistencies
# Check if Ollama enhances validation
from donkeykong.kong import AdversarialValidator, OllamaAdversarialValidator
# Rule-based only (always works)
validator = AdversarialValidator()
# LLM-enhanced (requires Ollama running)
try:
validator = OllamaAdversarialValidator(model="llama3.2")
except ImportError:
print("Ollama not installed, using rule-based validation")
validator = AdversarialValidator()# donkeykong.yml
workers: 10
redis_url: redis://localhost:6379
kong:
provider: ollama
model: llama3.2
validation_prompt: |
Evaluate this data for completeness and accuracy.
Return JSON: {"valid": bool, "issues": [...], "retry": bool}
collection:
rate_limit: 2.0 # seconds between requests per worker
retry_attempts: 3
checkpoint_interval: 100 # save progress every N entitiesWhen running as an MCP server, DonkeyKong exposes these tools to Claude:
| Tool | Description |
|---|---|
donkeykong_start |
Start a new collection job |
donkeykong_status |
Get current progress and stats |
donkeykong_failures |
List failed entities with reasons |
donkeykong_retry |
Retry failed entities with new strategy |
donkeykong_validate |
Manually validate a sample |
donkeykong_stop |
Gracefully stop collection |
DonkeyKong is ideal for any data pipeline that needs intelligent validation:
- Web scraping with content quality checks
- Document processing pipelines
- Training data curation for ML
- Knowledge graph construction
- Research data gathering
- API harvesting with response validation
- ETL pipelines where "is this data good?" requires reasoning
Contributions welcome! See CONTRIBUTING.md.
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=donkeykong --cov-report=term-missing
# Run the reproducible benchmark
cd examples/wikipedia_quality
python benchmark.py --articles 50The Wikipedia benchmark provides verifiable metrics:
| Metric | Expected | Notes |
|---|---|---|
| Collection success | 95%+ | Wikipedia API is reliable |
| Validation pass rate | 70-85% | Kong catches intentional flaws |
| Flagged for review | 15-30% | Adversarial questioning works |
MIT License - see LICENSE.
DonkeyKong: Because sometimes the best solution is to throw more barrels at the problem π¦π’οΈ
Built with Docker, Redis, Ollama, and a healthy respect for distributed systems