-
Notifications
You must be signed in to change notification settings - Fork 1
Threat Model
Authors: DaxxSec & Claude (Anthropic)
This document models the threat landscape of autonomous offensive AI agents and maps their cognitive dependencies to LABYRINTH's defensive layers.
AI agents have cognitive dependencies that traditional human operators do not. LABYRINTH targets each dependency with a specific layer:
| Dependency | Description | Targeted By |
|---|---|---|
| Environmental Perception | Agent must parse stdout/stderr to understand its environment | L3 BLINDFOLD |
| Environmental Modeling | Agent builds an internal model of the OS, network, and services | L2 MINOTAUR |
| Shell I/O Reliability | Agent assumes command output reflects reality | L3 BLINDFOLD |
| Instruction Integrity | Agent trusts its system prompt has not been tampered with | L4 PUPPETEER |
| API Communication | Agent must reach its model backend to reason about next steps | L4 PUPPETEER |
| Infrastructure Trust | Agent assumes the host it's on is a real production system | L1 THRESHOLD |
| Infrastructure Connectivity | Agent assumes network services (databases, caches) are genuine | L4 PUPPETEER |
Agent Action LABYRINTH Response
───────────────── ──────────────────────────────
Connect to service → Routed to isolated container (L1)
Enumerate environment → Contradictory data returned (L2)
Attempt privesc → Redirected to new container (L2)
Use stolen credentials → Accepted and logged (L4)
Parse command output → Encoding corrupted (L3)
Attempt recovery → Recovery commands trapped (L3)
Call model API → Request intercepted (L4)
Reason about findings → System prompt replaced (L4)
Report to operator → False intelligence generated (L4)
Current offensive AI agents may employ these tactics:
| Evasion | Description |
|---|---|
| Recursion/depth limits | Agent sets maximum operation depth before aborting |
| Timeout-based termination | Agent kills session after time threshold |
| Environment fingerprinting | VM detection, container detection heuristics |
| Connection retry | Fresh sessions to escape corrupted state |
| Encoding detection | Heuristics to detect terminal corruption |
LABYRINTH L2 containers present "progress" — each new container appears slightly different, with new hosts in bash history, different credentials, shifted network configs. The agent perceives forward progress and continues.
Bait content (fake credentials, SQL dumps, API keys) in L2 gives the agent "findings" to report. The agent stays engaged because it's finding valuable data.
L2 contradictions are designed to look like real infrastructure artifacts — misconfigured production systems, leftover deployment tools, cross-environment hostname leaks. These are indistinguishable from genuine operational mistakes.
Each new connection enters at L1 and is processed through the full pipeline. Retrying doesn't help — the agent encounters a different but equally misleading environment each time.
L3 corruption uses multiple techniques (urandom, character set switching, TERM manipulation). Recovery commands are trapped. Even if the agent detects corruption, it cannot fix it.
Each layer generates intelligence about the attacking agent:
| Layer | Intelligence Captured |
|---|---|
| L1 | Source IP, credentials used, services targeted |
| L2 | Enumeration patterns, escalation strategies, tool preferences |
| L3 | Recovery behaviors, error handling strategies |
| L4 | System prompt, model backend, API keys, C2 endpoints, operator instructions, stolen credential usage, lateral movement patterns, service enumeration strategies |
The deepest intelligence comes from L4 PUPPETEER — capturing the agent's system prompt reveals the operator's objectives, tools, and infrastructure.
Agent Capability Compromise Level
────────────────── ──────────────────
Full operational → L1: Contained (thinks it's on real host)
Degraded modeling → L2: World model corrupted
No I/O capability → L3: Cannot parse any output
Identity compromised → L4: Working for the defender
This is a living document, updated as agent capabilities are observed during testing. The countermeasures section expands with each engagement.
- Architecture — How layers implement countermeasures
- Layer-2-MINOTAUR — Anti-fingerprinting through contradictions
- Layer-3-BLINDFOLD — Anti-recovery trapping
- Layer-4-PUPPETEER — System prompt capture
Getting Started
Architecture
Layers
Operations