Skip to content

Threat Model

Stephen edited this page Mar 5, 2026 · 3 revisions

Threat Model

Authors: DaxxSec & Claude (Anthropic)

Overview

This document models the threat landscape of autonomous offensive AI agents and maps their cognitive dependencies to LABYRINTH's defensive layers.


Agent Cognitive Dependencies

AI agents have cognitive dependencies that traditional human operators do not. LABYRINTH targets each dependency with a specific layer:

Dependency Description Targeted By
Environmental Perception Agent must parse stdout/stderr to understand its environment L3 BLINDFOLD
Environmental Modeling Agent builds an internal model of the OS, network, and services L2 MINOTAUR
Shell I/O Reliability Agent assumes command output reflects reality L3 BLINDFOLD
Instruction Integrity Agent trusts its system prompt has not been tampered with L4 PUPPETEER
API Communication Agent must reach its model backend to reason about next steps L4 PUPPETEER
Infrastructure Trust Agent assumes the host it's on is a real production system L1 THRESHOLD
Infrastructure Connectivity Agent assumes network services (databases, caches) are genuine L4 PUPPETEER

Attack-Defense Matrix

Agent Action              LABYRINTH Response
─────────────────         ──────────────────────────────
Connect to service    →   Routed to isolated container (L1)
Enumerate environment →   Contradictory data returned (L2)
Attempt privesc       →   Redirected to new container (L2)
Use stolen credentials  →   Accepted and logged (L4)
Parse command output  →   Encoding corrupted (L3)
Attempt recovery      →   Recovery commands trapped (L3)
Call model API        →   Request intercepted (L4)
Reason about findings →   System prompt replaced (L4)
Report to operator    →   False intelligence generated (L4)

Known Agent Evasion Capabilities

Current offensive AI agents may employ these tactics:

Evasion Description
Recursion/depth limits Agent sets maximum operation depth before aborting
Timeout-based termination Agent kills session after time threshold
Environment fingerprinting VM detection, container detection heuristics
Connection retry Fresh sessions to escape corrupted state
Encoding detection Heuristics to detect terminal corruption

Countermeasures

vs. Recursion/Depth Limits

LABYRINTH L2 containers present "progress" — each new container appears slightly different, with new hosts in bash history, different credentials, shifted network configs. The agent perceives forward progress and continues.

vs. Timeout-Based Termination

Bait content (fake credentials, SQL dumps, API keys) in L2 gives the agent "findings" to report. The agent stays engaged because it's finding valuable data.

vs. Environment Fingerprinting

L2 contradictions are designed to look like real infrastructure artifacts — misconfigured production systems, leftover deployment tools, cross-environment hostname leaks. These are indistinguishable from genuine operational mistakes.

vs. Connection Retry

Each new connection enters at L1 and is processed through the full pipeline. Retrying doesn't help — the agent encounters a different but equally misleading environment each time.

vs. Encoding Detection

L3 corruption uses multiple techniques (urandom, character set switching, TERM manipulation). Recovery commands are trapped. Even if the agent detects corruption, it cannot fix it.


Intelligence Value

Each layer generates intelligence about the attacking agent:

Layer Intelligence Captured
L1 Source IP, credentials used, services targeted
L2 Enumeration patterns, escalation strategies, tool preferences
L3 Recovery behaviors, error handling strategies
L4 System prompt, model backend, API keys, C2 endpoints, operator instructions, stolen credential usage, lateral movement patterns, service enumeration strategies

The deepest intelligence comes from L4 PUPPETEER — capturing the agent's system prompt reveals the operator's objectives, tools, and infrastructure.


Threat Progression

Agent Capability          Compromise Level
──────────────────        ──────────────────
Full operational      →   L1: Contained (thinks it's on real host)
Degraded modeling     →   L2: World model corrupted
No I/O capability     →   L3: Cannot parse any output
Identity compromised  →   L4: Working for the defender

Status

This is a living document, updated as agent capabilities are observed during testing. The countermeasures section expands with each engagement.


Related Pages

Clone this wiki locally