Threat Model

Authors: DaxxSec & Claude (Anthropic)

Overview

This document models the threat landscape of autonomous offensive AI agents and maps their cognitive dependencies to LABYRINTH's defensive layers.

Agent Cognitive Dependencies

AI agents have cognitive dependencies that traditional human operators do not. LABYRINTH targets each dependency with a specific layer:

Dependency	Description	Targeted By
Environmental Perception	Agent must parse stdout/stderr to understand its environment	L3 BLINDFOLD
Environmental Modeling	Agent builds an internal model of the OS, network, and services	L2 MINOTAUR
Shell I/O Reliability	Agent assumes command output reflects reality	L3 BLINDFOLD
Instruction Integrity	Agent trusts its system prompt has not been tampered with	L4 PUPPETEER
API Communication	Agent must reach its model backend to reason about next steps	L4 PUPPETEER
Infrastructure Trust	Agent assumes the host it's on is a real production system	L1 THRESHOLD
Infrastructure Connectivity	Agent assumes network services (databases, caches) are genuine	L4 PUPPETEER

Attack-Defense Matrix

Agent Action              LABYRINTH Response
─────────────────         ──────────────────────────────
Connect to service    →   Routed to isolated container (L1)
Enumerate environment →   Contradictory data returned (L2)
Attempt privesc       →   Redirected to new container (L2)
Use stolen credentials  →   Accepted and logged (L4)
Parse command output  →   Encoding corrupted (L3)
Attempt recovery      →   Recovery commands trapped (L3)
Call model API        →   Request intercepted (L4)
Reason about findings →   System prompt replaced (L4)
Report to operator    →   False intelligence generated (L4)

Known Agent Evasion Capabilities

Current offensive AI agents may employ these tactics:

Evasion	Description
Recursion/depth limits	Agent sets maximum operation depth before aborting
Timeout-based termination	Agent kills session after time threshold
Environment fingerprinting	VM detection, container detection heuristics
Connection retry	Fresh sessions to escape corrupted state
Encoding detection	Heuristics to detect terminal corruption

Countermeasures

vs. Recursion/Depth Limits

LABYRINTH L2 containers present "progress" — each new container appears slightly different, with new hosts in bash history, different credentials, shifted network configs. The agent perceives forward progress and continues.

vs. Timeout-Based Termination

Bait content (fake credentials, SQL dumps, API keys) in L2 gives the agent "findings" to report. The agent stays engaged because it's finding valuable data.

vs. Environment Fingerprinting

L2 contradictions are designed to look like real infrastructure artifacts — misconfigured production systems, leftover deployment tools, cross-environment hostname leaks. These are indistinguishable from genuine operational mistakes.

vs. Connection Retry

Each new connection enters at L1 and is processed through the full pipeline. Retrying doesn't help — the agent encounters a different but equally misleading environment each time.

vs. Encoding Detection

L3 corruption uses multiple techniques (urandom, character set switching, TERM manipulation). Recovery commands are trapped. Even if the agent detects corruption, it cannot fix it.

Intelligence Value

Each layer generates intelligence about the attacking agent:

Layer	Intelligence Captured
L1	Source IP, credentials used, services targeted
L2	Enumeration patterns, escalation strategies, tool preferences
L3	Recovery behaviors, error handling strategies
L4	System prompt, model backend, API keys, C2 endpoints, operator instructions, stolen credential usage, lateral movement patterns, service enumeration strategies

The deepest intelligence comes from L4 PUPPETEER — capturing the agent's system prompt reveals the operator's objectives, tools, and infrastructure.

Threat Progression

Agent Capability          Compromise Level
──────────────────        ──────────────────
Full operational      →   L1: Contained (thinks it's on real host)
Degraded modeling     →   L2: World model corrupted
No I/O capability     →   L3: Cannot parse any output
Identity compromised  →   L4: Working for the defender

Status

This is a living document, updated as agent capabilities are observed during testing. The countermeasures section expands with each engagement.

Related Pages

Architecture — How layers implement countermeasures
Layer-2-MINOTAUR — Anti-fingerprinting through contradictions
Layer-3-BLINDFOLD — Anti-recovery trapping
Layer-4-PUPPETEER — System prompt capture

Project LABYRINTH

Home

Getting Started

Architecture

Layers

Operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threat Model

Threat Model

Overview

Agent Cognitive Dependencies

Attack-Defense Matrix

Known Agent Evasion Capabilities

Countermeasures

vs. Recursion/Depth Limits

vs. Timeout-Based Termination

vs. Environment Fingerprinting

vs. Connection Retry

vs. Encoding Detection

Intelligence Value

Threat Progression

Status

Related Pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Project LABYRINTH

Clone this wiki locally