This started as a narrative simulation engine. The greedy extraction step kept failing in ways that looked random but weren't. Investigating why led to a formal theory of when and how sequential systems break under endogenous constraints — constraints whose structure depends on the solution itself — and the discovery that LLMs exhibit the same failure mode under context compression.
We call this failure mode the validity mirage: the output scores high on fluency, coherence, and format compliance while silently substituting the specific facts that determine whether the answer is actually correct. The answer looks valid but its semantic pivot has shifted.
The four papers in this repo trace a single thread from engineering observation to formal theory to empirical validation:
| # | Paper | What it does |
|---|---|---|
| 0 | NarrativeField: Continuous Control & Structural Regularization | Documents the simulation engine that started this work — a deterministic multi-agent world (six characters, secrets, conflicting goals) with grammar-constrained story extraction. Across 3,250+ runs and 50 seeds (98% extraction validity), a systematic quality-validity tradeoff revealed that extraction failures were structural, not random. |
| 1 | Absorbing States in Greedy Search | Formalizes the extraction failures. When a turning point is defined by the data itself (endogenous), greedy search can lock into absorbing states where no local improvement can reach a valid solution. Standard greedoid theory assumes exogenous constraints and misses this. |
| 2 | Streaming Oscillation Traps | Extends the theory to streaming settings. Under incremental arrival, endogenous pivots create oscillation traps — the system cycles between candidate solutions without converging. |
| 3 | The Validity Mirage | Connects the theory to LLMs. Context compression is a form of lossy sequential processing with endogenous structure: the model's attention pattern determines which tokens matter, but which tokens matter depends on what the model attends to. The mirage is the empirical consequence. |
The practical consequence: standard evaluation pipelines — fluency, coherence, format compliance — can certify outputs as correct when they aren't. The failure is invisible to every metric except one that checks whether the specific fact the answer hinges on actually survived.
Across five instruction-tuned models, raw validity scores remain above 0.83 while pivot preservation drops as low as 0.42. The gap is the mirage.
Models tested: Gemma-2 9B, Llama-3.1 8B, Mistral 7B v0.3, Phi-3-Medium 14B, Qwen-2.5 14B. All bf16, greedy decoding, MirageBench 12-task set at compression levels 0.4/0.5/0.6.
The mirage also appears at the representation level. When KV-cache entries are evicted (retaining 70% down to 10% of keys), pivot preservation drops to 8.3% at 10% retention — even though all prerequisite information remains present in the input text. This isolates the failure to internal attention, not input truncation.
To test whether the mirage appears on real causal structures (not just synthetic benchmarks), we built a compression benchmark from NTSB aviation incident reports. Across 180 naive-compression trials (12 incidents × 5 seeds × 3 budgets), root-cause attribution shifts in 57% of cases (103/180). Of the 164 trials where compression actually degraded the output, 22% are silent mirages (36/164) — the model confidently names the wrong cause with no indication of uncertainty. A contract-guarded compression method (which preserves the endogenous pivot structure) eliminates attribution shift entirely across all budgets.
A LoRA adapter (3.2M parameters, ~0.12% of the base model), trained on synthetic mirage examples, eliminates the failure mode.
Provenance note:
- Canonical package for the table below is
mirage_aware_package.tar.gzat repo root (mirage_aware_adapter_balanced/adapter_config.json: baseQwen/Qwen2.5-7B-Instruct,r=8). - Canonical balanced package training config is
num_train_epochs=1,per_device_train_batch_size=2,gradient_accumulation_steps=4,global_step=250(about 2,000 train examples), not a 3-epoch run. - This package's eval slice is 400 examples (371 degraded, 29 strong); FT silent mirage is
1/371 = 0.27%on degraded rows. - The MLX/Gemma adapter in
endogenous_context_theory/release/adapters/mirage_aware_v1/is a separate run lineage.
| Base (Qwen 2.5 7B, balanced eval slice n=400) | + Mirage-aware LoRA | |
|---|---|---|
| Pivot accuracy (degraded inputs) | 41.0% | 99.2% |
| Silent mirage rate | 59.0% | 0.27% |
| Degradation flagging rate | 0% | 95.4% |
| False alarm rate (clean inputs) | 0% | 0% |
The adapter learns to both identify the correct pivot under compression and explicitly flag when context degradation may have affected its answer.
Canonical Qwen package artifact: mirage_aware_package.tar.gz (extracts mirage_aware_adapter_balanced/).
Separate MLX adapter artifact: endogenous_context_theory/release/adapters/mirage_aware_v1/.
For full provenance mapping, see docs/mirage-source-of-truth.md.
| Directory | Contents |
|---|---|
papers/ |
Four papers (PDFs) and canonical LaTeX sources (papers/sources/) |
projects/lorien/ |
NarrativeField — the narrative simulation engine where this started |
projects/rhun/ |
Rhun — the domain-agnostic greedy extraction failure framework |
endogenous_context_theory/src/ |
Tropical semiring algebra, compression, pivot-margin code |
endogenous_context_theory/tests/ |
18 synthetic validation experiments |
endogenous_context_theory/release/ |
MirageBench tasks, notebooks, result CSVs, figures, LoRA adapter |
endogenous_context_theory/results/ntsb/ |
Real-incident NTSB benchmark (external validation) |
# Setup
cd endogenous_context_theory
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Run all 18 synthetic validation experiments
python scripts/run_all.py
# Rebuild release figures and summary tables
python scripts/build_release_assets.pyThe blackbox and KV-cache experiments require GPU access. Open the notebooks in
release/notebooks/ on Colab or a local GPU machine:
miragebench_blackbox_bf16_5models_colab.ipynb— reproduces the 5-model sweepkv_cache_eviction_mirage_colab.ipynb— reproduces the KV retention curve
To load the mirage-aware adapter:
tar -xzf mirage_aware_package.tar.gz
# Adapter path after extract: mirage_aware_adapter_balanced/
# Base model: Qwen/Qwen2.5-7B-InstructSee endogenous_context_theory/release/README.md for the full artifact map
(paper section to file), integrity checksums, and inference protocol details.
See docs/reproducibility-checklist.md for the step-by-step checklist.
Paper publishing workflow:
./scripts/publish_papers_from_sources.sh@article{gaffney2026narrativefield,
title = {Continuous Control and Structural Regularization in Multi-Agent Narrative Extraction},
author = {Jack Chaudier Gaffney},
year = {2026},
journal = {Forthcoming}
}
@article{gaffney2026absorbing,
title = {Absorbing States in Greedy Search: When Endogenous Constraints Break Sequential Extraction},
author = {Jack Chaudier Gaffney},
year = {2026},
journal = {Forthcoming}
}
@article{gaffney2026streaming,
title = {Streaming Oscillation Traps in Endogenous-Pivot Sequential Extraction},
author = {Jack Chaudier Gaffney},
year = {2026},
journal = {Forthcoming}
}
@article{gaffney2026mirage,
title = {The Validity Mirage: Context Algebra for Endogenous Semantics under Memory Compression},
author = {Jack Chaudier Gaffney},
year = {2026},
journal = {Forthcoming}
}See individual directories for licensing details.


