gazefield

Patch-level predictive surprise from video foundation model embeddings.

The embedding delta is the attention signal. No state machines.

Modern video foundation models (V-JEPA 2, VideoMAE, etc.) produce spatially structured embeddings where each patch token encodes a local region of the visual field. gazefield exploits this structure to compute predictive visual attention — where the system expected the scene to be versus where it actually is — without hand-coded observation policies, state machines, or affordance tables.

The core finding: representation quality in video foundation models has quietly crossed a threshold where predictive spatial attention is a free byproduct of the embedding space, not a separate engineering problem.

Why this exists

Traditional visual attention systems require explicit saliency models, object detectors, or hand-designed observation policies. gazefield takes a different approach:

A video model converts a screenshot into spatial embeddings ([n_patches, dim])
A predictor maintains expectations about what the next frame should look like
The difference between prediction and reality is the attention signal

This works because modern video models produce embeddings where:

Global similarity captures scene identity (cosine sim 0.96 = "still in VS Code")
Patch-level deltas capture spatial change (mean sim 0.52 = "content is different")
82% of patches discriminate between desktop states without any fine-tuning

No training required for the baseline predictor. No labels. No domain-specific engineering. The model did the hard work during pre-training; gazefield just reads the gradient.

Installation

pip install gazefield              # core only (numpy)
pip install gazefield[vjepa]       # + V-JEPA 2 inference (torch, transformers)
pip install gazefield[capture]     # + screenshot capture (mss, Pillow)
pip install gazefield[all]         # everything

Requirements: Python 3.10+ | Core has no heavy dependencies (numpy only)

Quick start

Minimal — bring your own embeddings

from gazefield import VisualPredictor, compute_gates

predictor = VisualPredictor()

# Feed embeddings from any patch-based video model
# Shape: [n_patches, dim] — e.g. [2048, 1024] for V-JEPA 2 ViT-L
surprise = predictor.update(embedding)

print(surprise.aggregate_surprise)    # 0.0 - 1.0 (normalized)
print(surprise.top_k_patches[:5])     # most surprising spatial regions
print(surprise.is_cold_start)         # True during re-entry adaptation

# Compute memory gates for downstream systems
gates = compute_gates(surprise)
# {"alpha_forget": 0.07, "theta_learn": 0.55, "eta_momentum": 0.06}

Full pipeline — V-JEPA 2 + desktop capture

from gazefield import VisualPredictor, compute_gates
from gazefield.embeddings import VJEPAExtractor
from gazefield.capture import capture_screenshot, capture_action

extractor = VJEPAExtractor()        # facebook/vjepa2-vitl-fpc64-256 (MIT, 326M params)
predictor = VisualPredictor()

img = capture_screenshot()           # primary monitor via mss
embedding = extractor.extract(img)   # [2048, 1024] spatial tokens
action = capture_action()            # cursor position + window class

surprise = predictor.update(embedding, action)
gates = compute_gates(surprise)

print(f"Surprise: {surprise.aggregate_surprise:.3f}")
print(f"Top patch: {surprise.top_k_patches[0]}")
print(f"Learn rate: {gates['theta_learn']:.3f}")

Live demo

# Real-time desktop surprise loop with V-JEPA 2
python examples/live_demo.py --frames 20 --interval 3

# Test predictor logic without GPU (random embeddings)
python examples/live_demo.py --no-model --frames 50 --interval 0.5

How it works

Architecture

Screenshot / video frame
         |
         v
Video foundation model (V-JEPA 2, DINO, etc.)
         |
         v
[n_patches, dim] spatial embeddings
         |
         v
Predictor (EMA or learned MLP)
         |
    +---------+----------+
    |         |          |
    v         v          v
Per-patch   Top-K     Aggregate
surprise   surprise    surprise
  map      patches     score
    |                    |
    v                    v
Spatial              Memory gates
heatmap           (alpha, theta, eta)

EMA predictor (ships now)

The baseline predictor uses a time-weighted exponential moving average:

E_hat[t+1] = alpha * E[t] + (1 - alpha) * E_hat[t]
alpha = exp(-dt / tau)
surprise_i = ||E[t+1]_i - E_hat[t+1]_i||^2     per patch

Time-weighted decay means the system handles real usage patterns correctly:

Short gap (1s): alpha ~ 0.6 — smooth prediction, low surprise for steady content
Medium gap (5s): alpha ~ 0.08 — predictions decay, moderate surprise on return
Long gap (lunch): alpha ~ 0.0 — "I have no idea" — maximal surprise, then rapid re-adaptation

Cold-start suppression prevents the re-entry burst (returning from lunch) from being misinterpreted as genuine novel events. The first N frames after a long gap are flagged as is_cold_start=True.

Validated results (RTX 4090)

Metric	Value
V-JEPA 2 VRAM	686 MB
Inference latency	430 ms
Embedding shape	`[2048, 1024]`
Patch discrimination (between desktop states)	82% of patches below 0.8 similarity
Adaptation curve (8 frames, 2s interval)	0.95 → 0.73 → 0.55 → 0.47 → 0.44

The declining surprise curve shows the system learning what "normal" looks like for the current session. Not change detection — expectation formation.

Patch MLP predictor (architecture defined)

For learned prediction with action conditioning (~923K parameters):

For each patch i:
    context_i = mean(E_t[neighbors(i)])              # 8-connected spatial context
    action_compressed = Linear(action_vec, 32)        # event type + cursor patch + dt
    input_i = concat(E_t[i], context_i, action_compressed)
    pred_i = SharedMLP(input_i)                       # 2-layer GELU, hidden=256

Weights are shared across all 2048 patches — the model learns a universal spatial prediction function. Training data is automatically collected by the EMA predictor (triples logged when surprise exceeds threshold).

Action-location binding

Actions are bound to spatial locations, not just types:

from gazefield import DesktopAction

# A click at patch 847 (terminal region) has different predictive
# implications than a click at patch 200 (editor region)
action = DesktopAction(
    action_type="click",
    patch_idx=DesktopAction.cursor_to_patch(cursor_x=1050, cursor_y=900),
    window_class="CASCADIA_HOSTING",
)

Training triple bootstrap

The EMA predictor automatically logs (E_t, action_t, E_{t+1}) triples when surprise exceeds a threshold. This acts as automatic curriculum design — the training set is enriched for interesting transitions, not filled with steady-state noise.

from gazefield.triples import load_triples

# Load accumulated training data
batches = load_triples("gazefield_triples/")
for batch in batches:
    E_t = batch["embeddings_t"]       # [N, n_patches, dim]
    E_t1 = batch["embeddings_t1"]     # [N, n_patches, dim]
    actions = batch["actions"]         # list of action dicts
    surprises = batch["surprises"]     # [N] surprise scores

API reference

Core classes

Class	Description
`VisualPredictor`	Unified interface — EMA now, MLP upgrade path
`EMAPredictor`	Time-weighted EMA with cold-start suppression
`PatchMLPPredictor`	Shared-weight MLP with neighbor context (architecture defined)
`PatchSurprise`	Dataclass: per-patch MSE, aggregate score, top-K patches, cold-start flag
`DesktopAction`	Action with spatial binding (type + patch index + window class)
`VJEPAExtractor`	V-JEPA 2 model loading and embedding extraction
`TripleLogger`	Training triple buffer + NPZ/JSON serialization

Key functions

Function	Description
`compute_gates(surprise)`	Reference memory gate computation (alpha/theta/eta)
`capture_screenshot()`	Primary monitor capture via mss
`capture_action()`	Current cursor position + window class as DesktopAction
`load_triples(directory)`	Load all training triples from NPZ + JSON files

Project structure

src/gazefield/
    __init__.py          # Public API exports
    predictor.py         # EMA + MLP predictors + VisualPredictor
    surprise.py          # PatchSurprise dataclass + compute_gates
    actions.py           # DesktopAction + cursor-to-patch mapping
    triples.py           # Training triple logging + I/O
    embeddings.py        # V-JEPA 2 extractor (optional dependency)
    capture.py           # Screenshot + cursor utilities (optional dependency)
examples/
    live_demo.py         # Real-time desktop surprise loop
tests/
    test_ema.py          # 12 tests covering core predictor behavior

Use cases

Desktop agents — Know where on screen something changed and whether it was expected, without computer vision pipelines
Test-time memory systems — Gate memory writes by surprise magnitude (high surprise = learn fast, low = coast)
Active perception — Only invoke expensive VLM analysis when the predictor says something unexpected happened
Spatial attention research — Study how embedding spaces encode visual change across applications, contexts, and usage patterns

Design philosophy

Bitter lesson compliant. No hand-coded rules about what should be surprising. The EMA computes it. The MLP will learn it. The video model's pre-training did the hard work.

Model-agnostic. gazefield works with any model that produces patch-structured embeddings. V-JEPA 2 is the reference, but DINO, VideoMAE, or future models plug in the same way.

Composable. gazefield outputs surprise signals. What you do with them — gate memory writes, trigger VLM calls, drive UI attention indicators — is up to you. No framework lock-in.

Contributing

Issues and PRs welcome. The main areas for contribution:

MLP training pipeline — The architecture is defined; the training loop needs implementation
Cross-platform capture — capture.py cursor/window helpers are currently Windows-only
Additional video models — DINO, VideoMAE, SigLIP extractors alongside V-JEPA 2
Benchmarks — Systematic evaluation across application types and desktop contexts

Citation

If you use gazefield in research:

@software{gazefield2026,
  author = {Gillespie, Daniel},
  title = {gazefield: Patch-level predictive surprise from video foundation model embeddings},
  year = {2026},
  url = {https://github.com/DanielGillespie278/gazefield}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
src/gazefield		src/gazefield
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gazefield

Why this exists

Installation

Quick start

Minimal — bring your own embeddings

Full pipeline — V-JEPA 2 + desktop capture

Live demo

How it works

Architecture

EMA predictor (ships now)

Validated results (RTX 4090)

Patch MLP predictor (architecture defined)

Action-location binding

Training triple bootstrap

API reference

Core classes

Key functions

Project structure

Use cases

Design philosophy

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gazefield

Why this exists

Installation

Quick start

Minimal — bring your own embeddings

Full pipeline — V-JEPA 2 + desktop capture

Live demo

How it works

Architecture

EMA predictor (ships now)

Validated results (RTX 4090)

Patch MLP predictor (architecture defined)

Action-location binding

Training triple bootstrap

API reference

Core classes

Key functions

Project structure

Use cases

Design philosophy

Contributing

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages