The Validity Mirage

This started as a narrative simulation engine. The greedy extraction step kept failing in ways that looked random but weren't. Investigating why led to a formal theory of when and how sequential systems break under endogenous constraints — constraints whose structure depends on the solution itself — and the discovery that LLMs exhibit the same failure mode under context compression.

We call this failure mode the validity mirage: the output scores high on fluency, coherence, and format compliance while silently substituting the specific facts that determine whether the answer is actually correct. The answer looks valid but its semantic pivot has shifted.

How we got here

The four papers in this repo trace a single thread from engineering observation to formal theory to empirical validation:

#	Paper	What it does
0	NarrativeField: Continuous Control & Structural Regularization	Documents the simulation engine that started this work — a deterministic multi-agent world (six characters, secrets, conflicting goals) with grammar-constrained story extraction. Across 3,250+ runs and 50 seeds (98% extraction validity), a systematic quality-validity tradeoff revealed that extraction failures were structural, not random.
1	Absorbing States in Greedy Search	Formalizes the extraction failures. When a turning point is defined by the data itself (endogenous), greedy search can lock into absorbing states where no local improvement can reach a valid solution. Standard greedoid theory assumes exogenous constraints and misses this.
2	Streaming Oscillation Traps	Extends the theory to streaming settings. Under incremental arrival, endogenous pivots create oscillation traps — the system cycles between candidate solutions without converging.
3	The Validity Mirage	Connects the theory to LLMs. Context compression is a form of lossy sequential processing with endogenous structure: the model's attention pattern determines which tokens matter, but which tokens matter depends on what the model attends to. The mirage is the empirical consequence.

The practical consequence: standard evaluation pipelines — fluency, coherence, format compliance — can certify outputs as correct when they aren't. The failure is invisible to every metric except one that checks whether the specific fact the answer hinges on actually survived.

The core result

Across five instruction-tuned models, raw validity scores remain above 0.83 while pivot preservation drops as low as 0.42. The gap is the mirage.

Models tested: Gemma-2 9B, Llama-3.1 8B, Mistral 7B v0.3, Phi-3-Medium 14B, Qwen-2.5 14B. All bf16, greedy decoding, MirageBench 12-task set at compression levels 0.4/0.5/0.6.

KV-cache eviction

The mirage also appears at the representation level. When KV-cache entries are evicted (retaining 70% down to 10% of keys), pivot preservation drops to 8.3% at 10% retention — even though all prerequisite information remains present in the input text. This isolates the failure to internal attention, not input truncation.

Real-incident validation (NTSB)

To test whether the mirage appears on real causal structures (not just synthetic benchmarks), we built a compression benchmark from NTSB aviation incident reports. Across 180 naive-compression trials (12 incidents × 5 seeds × 3 budgets), root-cause attribution shifts in 57% of cases (103/180). Of the 164 trials where compression actually degraded the output, 22% are silent mirages (36/164) — the model confidently names the wrong cause with no indication of uncertainty. A contract-guarded compression method (which preserves the endogenous pivot structure) eliminates attribution shift entirely across all budgets.

Mirage-aware fine-tuning

A LoRA adapter (3.2M parameters, ~0.12% of the base model), trained on synthetic mirage examples, eliminates the failure mode.

Provenance note:

Canonical package for the table below is mirage_aware_package.tar.gz at repo root (mirage_aware_adapter_balanced/adapter_config.json: base Qwen/Qwen2.5-7B-Instruct, r=8).
Canonical balanced package training config is num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, global_step=250 (about 2,000 train examples), not a 3-epoch run.
This package's eval slice is 400 examples (371 degraded, 29 strong); FT silent mirage is 1/371 = 0.27% on degraded rows.
The MLX/Gemma adapter in endogenous_context_theory/release/adapters/mirage_aware_v1/ is a separate run lineage.

	Base (Qwen 2.5 7B, balanced eval slice n=400)	+ Mirage-aware LoRA
Pivot accuracy (degraded inputs)	41.0%	99.2%
Silent mirage rate	59.0%	0.27%
Degradation flagging rate	0%	95.4%
False alarm rate (clean inputs)	0%	0%

The adapter learns to both identify the correct pivot under compression and explicitly flag when context degradation may have affected its answer. Canonical Qwen package artifact: mirage_aware_package.tar.gz (extracts mirage_aware_adapter_balanced/). Separate MLX adapter artifact: endogenous_context_theory/release/adapters/mirage_aware_v1/. For full provenance mapping, see docs/mirage-source-of-truth.md.

What's in this repo

Directory	Contents
`papers/`	Four papers (PDFs) and canonical LaTeX sources (`papers/sources/`)
`projects/lorien/`	NarrativeField — the narrative simulation engine where this started
`projects/rhun/`	Rhun — the domain-agnostic greedy extraction failure framework
`endogenous_context_theory/src/`	Tropical semiring algebra, compression, pivot-margin code
`endogenous_context_theory/tests/`	18 synthetic validation experiments
`endogenous_context_theory/release/`	MirageBench tasks, notebooks, result CSVs, figures, LoRA adapter
`endogenous_context_theory/results/ntsb/`	Real-incident NTSB benchmark (external validation)

Quick start

# Setup
cd endogenous_context_theory
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Run all 18 synthetic validation experiments
python scripts/run_all.py

# Rebuild release figures and summary tables
python scripts/build_release_assets.py

The blackbox and KV-cache experiments require GPU access. Open the notebooks in release/notebooks/ on Colab or a local GPU machine:

miragebench_blackbox_bf16_5models_colab.ipynb — reproduces the 5-model sweep
kv_cache_eviction_mirage_colab.ipynb — reproduces the KV retention curve

To load the mirage-aware adapter:

tar -xzf mirage_aware_package.tar.gz
# Adapter path after extract: mirage_aware_adapter_balanced/
# Base model: Qwen/Qwen2.5-7B-Instruct

Reproducibility

See endogenous_context_theory/release/README.md for the full artifact map (paper section to file), integrity checksums, and inference protocol details. See docs/reproducibility-checklist.md for the step-by-step checklist.

Paper publishing workflow:

./scripts/publish_papers_from_sources.sh

Citation

@article{gaffney2026narrativefield,
  title   = {Continuous Control and Structural Regularization in Multi-Agent Narrative Extraction},
  author  = {Jack Chaudier Gaffney},
  year    = {2026},
  journal = {Forthcoming}
}

@article{gaffney2026absorbing,
  title   = {Absorbing States in Greedy Search: When Endogenous Constraints Break Sequential Extraction},
  author  = {Jack Chaudier Gaffney},
  year    = {2026},
  journal = {Forthcoming}
}

@article{gaffney2026streaming,
  title   = {Streaming Oscillation Traps in Endogenous-Pivot Sequential Extraction},
  author  = {Jack Chaudier Gaffney},
  year    = {2026},
  journal = {Forthcoming}
}

@article{gaffney2026mirage,
  title   = {The Validity Mirage: Context Algebra for Endogenous Semantics under Memory Compression},
  author  = {Jack Chaudier Gaffney},
  year    = {2026},
  journal = {Forthcoming}
}

License

See individual directories for licensing details.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
apps/narrator		apps/narrator
derived		derived
docs		docs
endogenous_context_theory		endogenous_context_theory
papers		papers
projects		projects
scripts		scripts
tropical-compactor		tropical-compactor
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Validity Mirage

How we got here

The core result

KV-cache eviction

Real-incident validation (NTSB)

Mirage-aware fine-tuning

What's in this repo

Quick start

Reproducibility

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Validity Mirage

How we got here

The core result

KV-cache eviction

Real-incident validation (NTSB)

Mirage-aware fine-tuning

What's in this repo

Quick start

Reproducibility

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages