Patch-level predictive surprise from video foundation model embeddings.
The embedding delta is the attention signal. No state machines.
Modern video foundation models (V-JEPA 2, VideoMAE, etc.) produce spatially structured embeddings where each patch token encodes a local region of the visual field. gazefield exploits this structure to compute predictive visual attention — where the system expected the scene to be versus where it actually is — without hand-coded observation policies, state machines, or affordance tables.
The core finding: representation quality in video foundation models has quietly crossed a threshold where predictive spatial attention is a free byproduct of the embedding space, not a separate engineering problem.
Traditional visual attention systems require explicit saliency models, object detectors, or hand-designed observation policies. gazefield takes a different approach:
- A video model converts a screenshot into spatial embeddings (
[n_patches, dim]) - A predictor maintains expectations about what the next frame should look like
- The difference between prediction and reality is the attention signal
This works because modern video models produce embeddings where:
- Global similarity captures scene identity (cosine sim 0.96 = "still in VS Code")
- Patch-level deltas capture spatial change (mean sim 0.52 = "content is different")
- 82% of patches discriminate between desktop states without any fine-tuning
No training required for the baseline predictor. No labels. No domain-specific engineering. The model did the hard work during pre-training; gazefield just reads the gradient.
pip install gazefield # core only (numpy)
pip install gazefield[vjepa] # + V-JEPA 2 inference (torch, transformers)
pip install gazefield[capture] # + screenshot capture (mss, Pillow)
pip install gazefield[all] # everythingRequirements: Python 3.10+ | Core has no heavy dependencies (numpy only)
from gazefield import VisualPredictor, compute_gates
predictor = VisualPredictor()
# Feed embeddings from any patch-based video model
# Shape: [n_patches, dim] — e.g. [2048, 1024] for V-JEPA 2 ViT-L
surprise = predictor.update(embedding)
print(surprise.aggregate_surprise) # 0.0 - 1.0 (normalized)
print(surprise.top_k_patches[:5]) # most surprising spatial regions
print(surprise.is_cold_start) # True during re-entry adaptation
# Compute memory gates for downstream systems
gates = compute_gates(surprise)
# {"alpha_forget": 0.07, "theta_learn": 0.55, "eta_momentum": 0.06}from gazefield import VisualPredictor, compute_gates
from gazefield.embeddings import VJEPAExtractor
from gazefield.capture import capture_screenshot, capture_action
extractor = VJEPAExtractor() # facebook/vjepa2-vitl-fpc64-256 (MIT, 326M params)
predictor = VisualPredictor()
img = capture_screenshot() # primary monitor via mss
embedding = extractor.extract(img) # [2048, 1024] spatial tokens
action = capture_action() # cursor position + window class
surprise = predictor.update(embedding, action)
gates = compute_gates(surprise)
print(f"Surprise: {surprise.aggregate_surprise:.3f}")
print(f"Top patch: {surprise.top_k_patches[0]}")
print(f"Learn rate: {gates['theta_learn']:.3f}")# Real-time desktop surprise loop with V-JEPA 2
python examples/live_demo.py --frames 20 --interval 3
# Test predictor logic without GPU (random embeddings)
python examples/live_demo.py --no-model --frames 50 --interval 0.5Screenshot / video frame
|
v
Video foundation model (V-JEPA 2, DINO, etc.)
|
v
[n_patches, dim] spatial embeddings
|
v
Predictor (EMA or learned MLP)
|
+---------+----------+
| | |
v v v
Per-patch Top-K Aggregate
surprise surprise surprise
map patches score
| |
v v
Spatial Memory gates
heatmap (alpha, theta, eta)
The baseline predictor uses a time-weighted exponential moving average:
E_hat[t+1] = alpha * E[t] + (1 - alpha) * E_hat[t]
alpha = exp(-dt / tau)
surprise_i = ||E[t+1]_i - E_hat[t+1]_i||^2 per patch
Time-weighted decay means the system handles real usage patterns correctly:
- Short gap (1s):
alpha ~ 0.6— smooth prediction, low surprise for steady content - Medium gap (5s):
alpha ~ 0.08— predictions decay, moderate surprise on return - Long gap (lunch):
alpha ~ 0.0— "I have no idea" — maximal surprise, then rapid re-adaptation
Cold-start suppression prevents the re-entry burst (returning from lunch) from being misinterpreted as genuine novel events. The first N frames after a long gap are flagged as is_cold_start=True.
| Metric | Value |
|---|---|
| V-JEPA 2 VRAM | 686 MB |
| Inference latency | 430 ms |
| Embedding shape | [2048, 1024] |
| Patch discrimination (between desktop states) | 82% of patches below 0.8 similarity |
| Adaptation curve (8 frames, 2s interval) | 0.95 → 0.73 → 0.55 → 0.47 → 0.44 |
The declining surprise curve shows the system learning what "normal" looks like for the current session. Not change detection — expectation formation.
For learned prediction with action conditioning (~923K parameters):
For each patch i:
context_i = mean(E_t[neighbors(i)]) # 8-connected spatial context
action_compressed = Linear(action_vec, 32) # event type + cursor patch + dt
input_i = concat(E_t[i], context_i, action_compressed)
pred_i = SharedMLP(input_i) # 2-layer GELU, hidden=256
Weights are shared across all 2048 patches — the model learns a universal spatial prediction function. Training data is automatically collected by the EMA predictor (triples logged when surprise exceeds threshold).
Actions are bound to spatial locations, not just types:
from gazefield import DesktopAction
# A click at patch 847 (terminal region) has different predictive
# implications than a click at patch 200 (editor region)
action = DesktopAction(
action_type="click",
patch_idx=DesktopAction.cursor_to_patch(cursor_x=1050, cursor_y=900),
window_class="CASCADIA_HOSTING",
)The EMA predictor automatically logs (E_t, action_t, E_{t+1}) triples when surprise exceeds a threshold. This acts as automatic curriculum design — the training set is enriched for interesting transitions, not filled with steady-state noise.
from gazefield.triples import load_triples
# Load accumulated training data
batches = load_triples("gazefield_triples/")
for batch in batches:
E_t = batch["embeddings_t"] # [N, n_patches, dim]
E_t1 = batch["embeddings_t1"] # [N, n_patches, dim]
actions = batch["actions"] # list of action dicts
surprises = batch["surprises"] # [N] surprise scores| Class | Description |
|---|---|
VisualPredictor |
Unified interface — EMA now, MLP upgrade path |
EMAPredictor |
Time-weighted EMA with cold-start suppression |
PatchMLPPredictor |
Shared-weight MLP with neighbor context (architecture defined) |
PatchSurprise |
Dataclass: per-patch MSE, aggregate score, top-K patches, cold-start flag |
DesktopAction |
Action with spatial binding (type + patch index + window class) |
VJEPAExtractor |
V-JEPA 2 model loading and embedding extraction |
TripleLogger |
Training triple buffer + NPZ/JSON serialization |
| Function | Description |
|---|---|
compute_gates(surprise) |
Reference memory gate computation (alpha/theta/eta) |
capture_screenshot() |
Primary monitor capture via mss |
capture_action() |
Current cursor position + window class as DesktopAction |
load_triples(directory) |
Load all training triples from NPZ + JSON files |
src/gazefield/
__init__.py # Public API exports
predictor.py # EMA + MLP predictors + VisualPredictor
surprise.py # PatchSurprise dataclass + compute_gates
actions.py # DesktopAction + cursor-to-patch mapping
triples.py # Training triple logging + I/O
embeddings.py # V-JEPA 2 extractor (optional dependency)
capture.py # Screenshot + cursor utilities (optional dependency)
examples/
live_demo.py # Real-time desktop surprise loop
tests/
test_ema.py # 12 tests covering core predictor behavior
- Desktop agents — Know where on screen something changed and whether it was expected, without computer vision pipelines
- Test-time memory systems — Gate memory writes by surprise magnitude (high surprise = learn fast, low = coast)
- Active perception — Only invoke expensive VLM analysis when the predictor says something unexpected happened
- Spatial attention research — Study how embedding spaces encode visual change across applications, contexts, and usage patterns
Bitter lesson compliant. No hand-coded rules about what should be surprising. The EMA computes it. The MLP will learn it. The video model's pre-training did the hard work.
Model-agnostic. gazefield works with any model that produces patch-structured embeddings. V-JEPA 2 is the reference, but DINO, VideoMAE, or future models plug in the same way.
Composable. gazefield outputs surprise signals. What you do with them — gate memory writes, trigger VLM calls, drive UI attention indicators — is up to you. No framework lock-in.
Issues and PRs welcome. The main areas for contribution:
- MLP training pipeline — The architecture is defined; the training loop needs implementation
- Cross-platform capture —
capture.pycursor/window helpers are currently Windows-only - Additional video models — DINO, VideoMAE, SigLIP extractors alongside V-JEPA 2
- Benchmarks — Systematic evaluation across application types and desktop contexts
If you use gazefield in research:
@software{gazefield2026,
author = {Gillespie, Daniel},
title = {gazefield: Patch-level predictive surprise from video foundation model embeddings},
year = {2026},
url = {https://github.com/DanielGillespie278/gazefield}
}