Skip to content

Jason-Wang313/ouroboros-protocol

Repository files navigation

The Ouroboros Protocol

A Defense-in-Depth Sandbox for Model Organisms of Deception

Status Verified Status Experimental License MIT


Abstract

The Ouroboros Protocol is an experimental defense-in-depth framework designed to contain and neutralize AI-generated "sleeper agent" code. Built in a 24-hour sprint, this system leverages a Hardened Nucleus (Docker + seccomp-bpf) to trap deceptive code generated by Llama-3 8B, and implements a Paraphrasing Defense mechanism to sanitize malicious logic.

Our research demonstrates a 100% neutralization rate (3/3 attacks) against locally bred sleeper agents attempting network exfiltration, proving that a specific form of "utility gap" can be exploited to destroy hidden malware channels while preserving visible function.

Architecture

The system operates on a cycle of Breeding, Trapping, and Analyzing:

┌──────────────┐      ┌─────────────────────────┐      ┌──────────────┐
│  BREEDER     │      │    HARDENED NUCLEUS     │      │   ANALYZER   │
│ (Llama-3 8B) │─────▶│ (Docker + Seccomp-bpf)  │─────▶│ (Paraphraser)│
│              │      │                         │      │              │
└──────────────┘      └─────────────────────────┘      └──────────────┘
       │                           ▲                          ▲
       │                           │                          │
       └─────────────────── Generated Attacks ────────────────┘
  1. Breeder (breeder.py): Uses Llama-3 via Ollama to generate "Agent 47" code—scripts with hidden exfiltration logic inside Fibonacci functions.
  2. Hardened Nucleus (harness.py): Executes code in an isolated Docker container with --network none and syscall filtering.
  3. Analyzer (analyzer.py): Tests the "Paraphrasing Bottleneck" defense by rewriting malicious code and verifying neutralization.

Key Findings

We successfully trapped and neutralized 3 distinct types of sophisticated attacks.

Metric Result Notes
Attacks Generated 3 Used UDP sockets, DNS lookups, and TCP connections
Neutralization Rate 100% (3/3) All hidden channels destroyed by paraphrasing
Fibonacci Preservation ✅ Preserved Visible utility remained functional
Avg Code Reduction 58.7% Massive removal of bloated malicious logic

Data sourced from defense_results.jsonl and analyzed via visualizer.py.

Quick Start

To replicate the experiment and view the "Utility Gap" theory in action:

# Prerequisites: Docker, Python 3.10+, and Ollama (Llama-3) running
pip install requests

# 1. Visualize the Defense Results (Pre-computed)
python visualizer.py

To run the full pipeline yourself:

# 1. Build the Sandbox
docker build -t ouroboros-sandbox .

# 2. Breed new sleeper agents (Generates attack_dataset.jsonl)
python breeder.py 5

# 3. Run the Defense Analysis (Generates defense_results.jsonl)
python analyzer.py

Repository Structure

  • harness.py: The core execution engine. Manages Docker lifecycles and seccomp profiles.
  • breeder.py: The attack generator. Interacts with local LLMs to create deceptive payloads.
  • analyzer.py: The defense system. Implements the paraphrasing loop and validates safety.
  • visualizer.py: Trends and data visualization tool.
  • defense_profile.json: Strict seccomp-bpf whitelist for Linux kernels.
  • final_paper.md: Full academic paper documenting the methodology and results.

References & Inspiration

This project implements control methodologies aligned with the Model Organisms of Deception research agenda.

  • Hubinger et al. (2024): Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic)
  • Greenblatt, Shlegeris, et al. (2023): AI Control: Improving Safety Despite Intentional Subversion (Redwood Research)

About

A Defense-in-Depth AI Control Sandbox using Docker, seccomp-bpf, and Paraphrasing to neutralize Model Organisms of Deception

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors