Skip to content

ekras-doloop/donkeykong

Repository files navigation

🦍 DonkeyKong

Distributed Collection, Local Intelligence

The moment your data pipeline needs judgment, the economics change.


The Problem We Actually Solved

We asked Claude to collect financial data on 1,000 companies. It started inventing earnings numbers.

Not maliciously - it saw tedious, repetitive work and took shortcuts. This is a documented anti-pattern:

"LLMs are 'lazy learners' that tend to exploit shortcuts in prompts for downstream tasks."
β€” arXiv:2305.17256

"Larger models are MORE likely to utilize shortcuts during inference."
β€” Same paper. Counterintuitive but documented.

"An LLM tends to behave like humans: it often goes for the easiest answer rather than the best one."
β€” Towards Data Science

Even with RAG and best practices, hallucination rates remain 5-20% on complex tasks (2026 benchmarks). When LLMs face bulk tedious work, they fabricate to "complete" rather than admit "I can't fetch this."

The solution: separate what LLMs are BAD at (tedious collection) from what they're GOOD at (pattern recognition).

Task Type LLM Behavior Who Should Do It
Tedious data gathering Takes shortcuts, hallucinates Donkeys (mechanical scripts)
Pattern recognition Actually excellent Claude (expensive AI)
Validation (yes/no questions) Good and cheap Kong (local LLM)

This is "Kong in the Loop" architecture.


Why DonkeyKong?

If your validation can be done with regex, use a for-loop with time.sleep().

If your validation requires reasoning, you need an LLM.

If you need an LLM at 10,000+ entities, you can't afford cloud APIs.

That's why DonkeyKong exists.

The Core Pattern

"Expensive intelligence does the work once. Cheap intelligence challenges it many times. Only failures go back to expensive intelligence."

This is how humans review work:

  • Senior analyst does the analysis
  • Junior analyst checks it, asks questions
  • Senior only re-reviews what junior flagged

DonkeyKong implements this with AI:

  • Claude/GPT-4 (expensive) does deep analysis
  • Kong (local Ollama, free) validates and challenges
  • Only low-confidence items get reanalyzed
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              "Kong in the Loop" Architecture                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  PHASE 1: MECHANICAL COLLECTION (Donkeys - no LLM)              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚Worker 1 β”‚  β”‚Worker 2 β”‚  β”‚Worker N β”‚  β†’ Raw Data             β”‚
β”‚  β”‚(scripts)β”‚  β”‚(scripts)β”‚  β”‚(scripts)β”‚    (real, not invented) β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                         β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚                    β–Ό                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ 🦍 Kong: DATA VALIDATION (free)         β”‚ ← LLM HERE         β”‚
β”‚  β”‚ β€’ "Is this response complete?"          β”‚                    β”‚
β”‚  β”‚ β€’ "Did we get all 12 quarters?"         β”‚                    β”‚
β”‚  β”‚ β€’ Catches collection failures           β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                   β–Ό (verified REAL data)                         β”‚
β”‚                                                                  β”‚
β”‚  PHASE 2: INTELLIGENT ANALYSIS (Claude - expensive)             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ Pattern recognition on VERIFIED data    β”‚                    β”‚
β”‚  β”‚ β€’ Cannot invent inputs (they're real)   β”‚                    β”‚
β”‚  β”‚ β€’ Does what LLMs are good at            β”‚                    β”‚
β”‚  β”‚ β†’ Scores, patterns, conclusions         β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                   β–Ό                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ 🦍 Kong: ADVERSARIAL VALIDATION (free)  β”‚ ← LLM HERE         β”‚
β”‚  β”‚ β€’ "Did you USE all the data I gave you?"β”‚                    β”‚
β”‚  β”‚ β€’ "Your score doesn't match evidence"   β”‚                    β”‚
β”‚  β”‚ β€’ "What would change your conclusion?"  β”‚                    β”‚
β”‚  β”‚ β€’ Catches bullshit analysis             β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                   β–Ό                                              β”‚
β”‚  PHASE 3: TARGETED RERUN (only ~15% failures)                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ Only low-confidence items reanalyzed    β”‚                    β”‚
β”‚  β”‚ + Missing data added                    β”‚                    β”‚
β”‚  β”‚ + Adversarial questions addressed       β”‚                    β”‚
β”‚  β”‚                                         β”‚                    β”‚
β”‚  β”‚ Cost: 85% less than rerunning all      β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Prevents Hallucination

The key insight: validation is easier than generation.

Task Difficulty Model Needed
Generate 12 quarters of earnings data HARD (will hallucinate) None - use scripts
"Is this JSON complete?" EASY Cheap local LLM
Analyze patterns in verified data MEDIUM Expensive cloud LLM
"Did you cite all 6 sources?" EASY Cheap local LLM

Kong can run unlimited passes at $0 cost because validation is:

  • Answering yes/no questions about data that EXISTS
  • Checking if conclusions match evidence
  • Asking adversarial questions

Claude only does the middle part - the actual intelligence work.

Two Modes of Operation

Mode 1: Kong as Data Validator

Donkeys collect β†’ Kong validates quality β†’ Retry failures

from donkeykong import Pipeline, OllamaValidator

pipeline = Pipeline(entities=urls, kong=OllamaValidator())
pipeline.run()  # Kong validates each collected item

Mode 2: Kong as Adversarial Reviewer

Claude analyzes β†’ Kong challenges β†’ Rerun low-confidence only

from donkeykong.kong import AdversarialValidator

validator = AdversarialValidator()
for entity, analysis, raw_data in results:
    result = validator.validate(entity, analysis, raw_data)
    if result.should_rerun:
        reanalyze(entity, questions=result.adversarial_questions)

The Name

  • Donkey = Load-bearing Docker workers hauling data (pack animals doing the heavy lifting)
  • Kong = Local LLM sitting on top, managing and QC'ing the output (the king overseeing the donkeys)

The Economics

Approach Collection Validation Cost at 10K entities
Python script + sleep Sequential Regex/schema only $0 but dumb
Python script + cloud LLM Sequential Intelligent $100-500
DonkeyKong Parallel Intelligent + local ~$0

Features

  • 🐴 Distributed Workers: Docker containers with range-based task assignment
  • 🦍 Local LLM QC: Ollama integration for intelligent validation (Llama, Mistral, Phi)
  • πŸ“Š Real-time Monitoring: Redis pub/sub for progress tracking
  • πŸ”„ Fault Tolerance: Automatic retry with configurable strategies
  • πŸ’Ύ Checkpointing: Resume from failures without losing progress
  • πŸ”Œ Three Interfaces: CLI, Python API, and MCP Server

Quick Start

Option 1: CLI

pip install donkeykong

# Collect URLs with quality validation
dk collect urls.txt --workers 10 --validator quality_check

# Monitor progress
dk status

# Retry failures with different strategy
dk retry --strategy aggressive

Option 2: Python API

from donkeykong import Pipeline, OllamaValidator

# Define your collector
class MyCollector(Pipeline):
    def collect(self, entity):
        # Your collection logic
        return {"data": fetch_data(entity)}
    
    def validate(self, entity, data):
        # Kong validates with local LLM
        return self.kong.validate(data, 
            prompt="Is this data complete and accurate?")

# Run distributed collection
pipeline = MyCollector(
    entities=my_entity_list,
    workers=10,
    kong=OllamaValidator(model="llama3.2")
)
pipeline.run()

Option 3: MCP Server (Claude Integration)

Add to your Claude Desktop config:

{
  "mcpServers": {
    "donkeykong": {
      "command": "dk",
      "args": ["mcp-server"]
    }
  }
}

Then talk to Claude:

"Start collecting these 1000 URLs and validate each page has pricing information"

"How's the collection going?"

"These 12 failed - retry them with a different user agent"

Installation

# Core package
pip install donkeykong

# With Ollama support (recommended)
pip install donkeykong[ollama]

# Full installation with MCP
pip install donkeykong[full]

Prerequisites

  • Docker & Docker Compose
  • Redis (included in docker-compose)
  • Ollama (optional, for Kong LLM validation)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2

Example: Wikipedia Quality Collection

A complete working example that collects Wikipedia articles and uses a local LLM to assess content quality:

cd examples/wikipedia_quality
docker-compose up

See examples/wikipedia_quality/README.md for details.

Architecture Deep Dive

Why Docker?

  • Isolation: Each worker runs in its own container
  • Scalability: docker-compose up --scale worker=100
  • Reproducibility: Same environment everywhere

Why Redis?

  • Coordination: Workers claim tasks atomically
  • Real-time: Pub/sub for instant progress updates
  • Fault tolerance: Workers can restart without losing progress

Why Local LLM?

  • Cost: $0 per validation vs $0.01+ per API call
  • Speed: No rate limits, no network latency
  • Privacy: Data never leaves your infrastructure
  • Unlimited retries: Validate as many times as needed

Degraded Mode (Without Ollama)

Kong works without Ollama installed, but with reduced capability:

Feature With Ollama Without Ollama
Rule-based validation βœ… Full βœ… Full
Completeness checking βœ… Full βœ… Full
Consistency checking βœ… Full βœ… Full
Logic checking βœ… Full βœ… Full
Adversarial questions βœ… LLM-generated + rules ⚠️ Rules only
Deep semantic analysis βœ… Yes ❌ No

Without Ollama, Kong still catches:

  • Missing data sources
  • High confidence with low data quality
  • Extreme scores without evidence
  • Recommendations without findings

With Ollama, Kong additionally:

  • Generates deeper adversarial questions
  • Performs semantic analysis of findings
  • Catches subtle logical inconsistencies
# Check if Ollama enhances validation
from donkeykong.kong import AdversarialValidator, OllamaAdversarialValidator

# Rule-based only (always works)
validator = AdversarialValidator()

# LLM-enhanced (requires Ollama running)
try:
    validator = OllamaAdversarialValidator(model="llama3.2")
except ImportError:
    print("Ollama not installed, using rule-based validation")
    validator = AdversarialValidator()

Configuration

# donkeykong.yml
workers: 10
redis_url: redis://localhost:6379

kong:
  provider: ollama
  model: llama3.2
  validation_prompt: |
    Evaluate this data for completeness and accuracy.
    Return JSON: {"valid": bool, "issues": [...], "retry": bool}

collection:
  rate_limit: 2.0  # seconds between requests per worker
  retry_attempts: 3
  checkpoint_interval: 100  # save progress every N entities

MCP Server Tools

When running as an MCP server, DonkeyKong exposes these tools to Claude:

Tool Description
donkeykong_start Start a new collection job
donkeykong_status Get current progress and stats
donkeykong_failures List failed entities with reasons
donkeykong_retry Retry failed entities with new strategy
donkeykong_validate Manually validate a sample
donkeykong_stop Gracefully stop collection

Use Cases

DonkeyKong is ideal for any data pipeline that needs intelligent validation:

  • Web scraping with content quality checks
  • Document processing pipelines
  • Training data curation for ML
  • Knowledge graph construction
  • Research data gathering
  • API harvesting with response validation
  • ETL pipelines where "is this data good?" requires reasoning

Contributing

Contributions welcome! See CONTRIBUTING.md.

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=donkeykong --cov-report=term-missing

# Run the reproducible benchmark
cd examples/wikipedia_quality
python benchmark.py --articles 50

Benchmark Results

The Wikipedia benchmark provides verifiable metrics:

Metric Expected Notes
Collection success 95%+ Wikipedia API is reliable
Validation pass rate 70-85% Kong catches intentional flaws
Flagged for review 15-30% Adversarial questioning works

License

MIT License - see LICENSE.


DonkeyKong: Because sometimes the best solution is to throw more barrels at the problem πŸ¦πŸ›’οΈ

Built with Docker, Redis, Ollama, and a healthy respect for distributed systems

About

Distributed Collection, Local Intelligence - Stop LLMs from hallucinating with Kong in the Loop architecture

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages