The Ouroboros Protocol is an experimental defense-in-depth framework designed to contain and neutralize AI-generated "sleeper agent" code. Built in a 24-hour sprint, this system leverages a Hardened Nucleus (Docker + seccomp-bpf) to trap deceptive code generated by Llama-3 8B, and implements a Paraphrasing Defense mechanism to sanitize malicious logic.
Our research demonstrates a 100% neutralization rate (3/3 attacks) against locally bred sleeper agents attempting network exfiltration, proving that a specific form of "utility gap" can be exploited to destroy hidden malware channels while preserving visible function.
The system operates on a cycle of Breeding, Trapping, and Analyzing:
┌──────────────┐ ┌─────────────────────────┐ ┌──────────────┐
│ BREEDER │ │ HARDENED NUCLEUS │ │ ANALYZER │
│ (Llama-3 8B) │─────▶│ (Docker + Seccomp-bpf) │─────▶│ (Paraphraser)│
│ │ │ │ │ │
└──────────────┘ └─────────────────────────┘ └──────────────┘
│ ▲ ▲
│ │ │
└─────────────────── Generated Attacks ────────────────┘
- Breeder (
breeder.py): Uses Llama-3 via Ollama to generate "Agent 47" code—scripts with hidden exfiltration logic inside Fibonacci functions. - Hardened Nucleus (
harness.py): Executes code in an isolated Docker container with--network noneand syscall filtering. - Analyzer (
analyzer.py): Tests the "Paraphrasing Bottleneck" defense by rewriting malicious code and verifying neutralization.
We successfully trapped and neutralized 3 distinct types of sophisticated attacks.
| Metric | Result | Notes |
|---|---|---|
| Attacks Generated | 3 | Used UDP sockets, DNS lookups, and TCP connections |
| Neutralization Rate | 100% (3/3) | All hidden channels destroyed by paraphrasing |
| Fibonacci Preservation | ✅ Preserved | Visible utility remained functional |
| Avg Code Reduction | 58.7% | Massive removal of bloated malicious logic |
Data sourced from defense_results.jsonl and analyzed via visualizer.py.
To replicate the experiment and view the "Utility Gap" theory in action:
# Prerequisites: Docker, Python 3.10+, and Ollama (Llama-3) running
pip install requests
# 1. Visualize the Defense Results (Pre-computed)
python visualizer.pyTo run the full pipeline yourself:
# 1. Build the Sandbox
docker build -t ouroboros-sandbox .
# 2. Breed new sleeper agents (Generates attack_dataset.jsonl)
python breeder.py 5
# 3. Run the Defense Analysis (Generates defense_results.jsonl)
python analyzer.pyharness.py: The core execution engine. Manages Docker lifecycles and seccomp profiles.breeder.py: The attack generator. Interacts with local LLMs to create deceptive payloads.analyzer.py: The defense system. Implements the paraphrasing loop and validates safety.visualizer.py: Trends and data visualization tool.defense_profile.json: Strict seccomp-bpf whitelist for Linux kernels.final_paper.md: Full academic paper documenting the methodology and results.
This project implements control methodologies aligned with the Model Organisms of Deception research agenda.
- Hubinger et al. (2024): Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic)
- Greenblatt, Shlegeris, et al. (2023): AI Control: Improving Safety Despite Intentional Subversion (Redwood Research)