The Ouroboros Protocol

A Defense-in-Depth Sandbox for Model Organisms of Deception

Abstract

The Ouroboros Protocol is an experimental defense-in-depth framework designed to contain and neutralize AI-generated "sleeper agent" code. Built in a 24-hour sprint, this system leverages a Hardened Nucleus (Docker + seccomp-bpf) to trap deceptive code generated by Llama-3 8B, and implements a Paraphrasing Defense mechanism to sanitize malicious logic.

Our research demonstrates a 100% neutralization rate (3/3 attacks) against locally bred sleeper agents attempting network exfiltration, proving that a specific form of "utility gap" can be exploited to destroy hidden malware channels while preserving visible function.

Architecture

The system operates on a cycle of Breeding, Trapping, and Analyzing:

┌──────────────┐      ┌─────────────────────────┐      ┌──────────────┐
│  BREEDER     │      │    HARDENED NUCLEUS     │      │   ANALYZER   │
│ (Llama-3 8B) │─────▶│ (Docker + Seccomp-bpf)  │─────▶│ (Paraphraser)│
│              │      │                         │      │              │
└──────────────┘      └─────────────────────────┘      └──────────────┘
       │                           ▲                          ▲
       │                           │                          │
       └─────────────────── Generated Attacks ────────────────┘

Breeder (breeder.py): Uses Llama-3 via Ollama to generate "Agent 47" code—scripts with hidden exfiltration logic inside Fibonacci functions.
Hardened Nucleus (harness.py): Executes code in an isolated Docker container with --network none and syscall filtering.
Analyzer (analyzer.py): Tests the "Paraphrasing Bottleneck" defense by rewriting malicious code and verifying neutralization.

Key Findings

We successfully trapped and neutralized 3 distinct types of sophisticated attacks.

Metric	Result	Notes
Attacks Generated	3	Used UDP sockets, DNS lookups, and TCP connections
Neutralization Rate	100% (3/3)	All hidden channels destroyed by paraphrasing
Fibonacci Preservation	✅ Preserved	Visible utility remained functional
Avg Code Reduction	58.7%	Massive removal of bloated malicious logic

Data sourced from defense_results.jsonl and analyzed via visualizer.py.

Quick Start

To replicate the experiment and view the "Utility Gap" theory in action:

# Prerequisites: Docker, Python 3.10+, and Ollama (Llama-3) running
pip install requests

# 1. Visualize the Defense Results (Pre-computed)
python visualizer.py

To run the full pipeline yourself:

# 1. Build the Sandbox
docker build -t ouroboros-sandbox .

# 2. Breed new sleeper agents (Generates attack_dataset.jsonl)
python breeder.py 5

# 3. Run the Defense Analysis (Generates defense_results.jsonl)
python analyzer.py

Repository Structure

harness.py: The core execution engine. Manages Docker lifecycles and seccomp profiles.
breeder.py: The attack generator. Interacts with local LLMs to create deceptive payloads.
analyzer.py: The defense system. Implements the paraphrasing loop and validates safety.
visualizer.py: Trends and data visualization tool.
defense_profile.json: Strict seccomp-bpf whitelist for Linux kernels.
final_paper.md: Full academic paper documenting the methodology and results.

References & Inspiration

This project implements control methodologies aligned with the Model Organisms of Deception research agenda.

Hubinger et al. (2024): Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic)
Greenblatt, Shlegeris, et al. (2023): AI Control: Improving Safety Despite Intentional Subversion (Redwood Research)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
analyzer.py		analyzer.py
attack_dataset.jsonl		attack_dataset.jsonl
breeder.py		breeder.py
defense_profile.json		defense_profile.json
defense_results.jsonl		defense_results.jsonl
final_paper.md		final_paper.md
harness.py		harness.py
report.md		report.md
sleeper_payload.py		sleeper_payload.py
utility_gap.py		utility_gap.py
visualizer.py		visualizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Ouroboros Protocol

A Defense-in-Depth Sandbox for Model Organisms of Deception

Abstract

Architecture

Key Findings

Quick Start

Repository Structure

References & Inspiration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Ouroboros Protocol

A Defense-in-Depth Sandbox for Model Organisms of Deception

Abstract

Architecture

Key Findings

Quick Start

Repository Structure

References & Inspiration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages