Skip to content

DanielGillespie278/gazefield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gazefield

Patch-level predictive surprise from video foundation model embeddings.

License: MIT Python 3.10+

The embedding delta is the attention signal. No state machines.

Modern video foundation models (V-JEPA 2, VideoMAE, etc.) produce spatially structured embeddings where each patch token encodes a local region of the visual field. gazefield exploits this structure to compute predictive visual attention — where the system expected the scene to be versus where it actually is — without hand-coded observation policies, state machines, or affordance tables.

The core finding: representation quality in video foundation models has quietly crossed a threshold where predictive spatial attention is a free byproduct of the embedding space, not a separate engineering problem.


Why this exists

Traditional visual attention systems require explicit saliency models, object detectors, or hand-designed observation policies. gazefield takes a different approach:

  1. A video model converts a screenshot into spatial embeddings ([n_patches, dim])
  2. A predictor maintains expectations about what the next frame should look like
  3. The difference between prediction and reality is the attention signal

This works because modern video models produce embeddings where:

  • Global similarity captures scene identity (cosine sim 0.96 = "still in VS Code")
  • Patch-level deltas capture spatial change (mean sim 0.52 = "content is different")
  • 82% of patches discriminate between desktop states without any fine-tuning

No training required for the baseline predictor. No labels. No domain-specific engineering. The model did the hard work during pre-training; gazefield just reads the gradient.


Installation

pip install gazefield              # core only (numpy)
pip install gazefield[vjepa]       # + V-JEPA 2 inference (torch, transformers)
pip install gazefield[capture]     # + screenshot capture (mss, Pillow)
pip install gazefield[all]         # everything

Requirements: Python 3.10+ | Core has no heavy dependencies (numpy only)


Quick start

Minimal — bring your own embeddings

from gazefield import VisualPredictor, compute_gates

predictor = VisualPredictor()

# Feed embeddings from any patch-based video model
# Shape: [n_patches, dim] — e.g. [2048, 1024] for V-JEPA 2 ViT-L
surprise = predictor.update(embedding)

print(surprise.aggregate_surprise)    # 0.0 - 1.0 (normalized)
print(surprise.top_k_patches[:5])     # most surprising spatial regions
print(surprise.is_cold_start)         # True during re-entry adaptation

# Compute memory gates for downstream systems
gates = compute_gates(surprise)
# {"alpha_forget": 0.07, "theta_learn": 0.55, "eta_momentum": 0.06}

Full pipeline — V-JEPA 2 + desktop capture

from gazefield import VisualPredictor, compute_gates
from gazefield.embeddings import VJEPAExtractor
from gazefield.capture import capture_screenshot, capture_action

extractor = VJEPAExtractor()        # facebook/vjepa2-vitl-fpc64-256 (MIT, 326M params)
predictor = VisualPredictor()

img = capture_screenshot()           # primary monitor via mss
embedding = extractor.extract(img)   # [2048, 1024] spatial tokens
action = capture_action()            # cursor position + window class

surprise = predictor.update(embedding, action)
gates = compute_gates(surprise)

print(f"Surprise: {surprise.aggregate_surprise:.3f}")
print(f"Top patch: {surprise.top_k_patches[0]}")
print(f"Learn rate: {gates['theta_learn']:.3f}")

Live demo

# Real-time desktop surprise loop with V-JEPA 2
python examples/live_demo.py --frames 20 --interval 3

# Test predictor logic without GPU (random embeddings)
python examples/live_demo.py --no-model --frames 50 --interval 0.5

How it works

Architecture

Screenshot / video frame
         |
         v
Video foundation model (V-JEPA 2, DINO, etc.)
         |
         v
[n_patches, dim] spatial embeddings
         |
         v
Predictor (EMA or learned MLP)
         |
    +---------+----------+
    |         |          |
    v         v          v
Per-patch   Top-K     Aggregate
surprise   surprise    surprise
  map      patches     score
    |                    |
    v                    v
Spatial              Memory gates
heatmap           (alpha, theta, eta)

EMA predictor (ships now)

The baseline predictor uses a time-weighted exponential moving average:

E_hat[t+1] = alpha * E[t] + (1 - alpha) * E_hat[t]
alpha = exp(-dt / tau)
surprise_i = ||E[t+1]_i - E_hat[t+1]_i||^2     per patch

Time-weighted decay means the system handles real usage patterns correctly:

  • Short gap (1s): alpha ~ 0.6 — smooth prediction, low surprise for steady content
  • Medium gap (5s): alpha ~ 0.08 — predictions decay, moderate surprise on return
  • Long gap (lunch): alpha ~ 0.0 — "I have no idea" — maximal surprise, then rapid re-adaptation

Cold-start suppression prevents the re-entry burst (returning from lunch) from being misinterpreted as genuine novel events. The first N frames after a long gap are flagged as is_cold_start=True.

Validated results (RTX 4090)

Metric Value
V-JEPA 2 VRAM 686 MB
Inference latency 430 ms
Embedding shape [2048, 1024]
Patch discrimination (between desktop states) 82% of patches below 0.8 similarity
Adaptation curve (8 frames, 2s interval) 0.95 → 0.73 → 0.55 → 0.47 → 0.44

The declining surprise curve shows the system learning what "normal" looks like for the current session. Not change detection — expectation formation.

Patch MLP predictor (architecture defined)

For learned prediction with action conditioning (~923K parameters):

For each patch i:
    context_i = mean(E_t[neighbors(i)])              # 8-connected spatial context
    action_compressed = Linear(action_vec, 32)        # event type + cursor patch + dt
    input_i = concat(E_t[i], context_i, action_compressed)
    pred_i = SharedMLP(input_i)                       # 2-layer GELU, hidden=256

Weights are shared across all 2048 patches — the model learns a universal spatial prediction function. Training data is automatically collected by the EMA predictor (triples logged when surprise exceeds threshold).

Action-location binding

Actions are bound to spatial locations, not just types:

from gazefield import DesktopAction

# A click at patch 847 (terminal region) has different predictive
# implications than a click at patch 200 (editor region)
action = DesktopAction(
    action_type="click",
    patch_idx=DesktopAction.cursor_to_patch(cursor_x=1050, cursor_y=900),
    window_class="CASCADIA_HOSTING",
)

Training triple bootstrap

The EMA predictor automatically logs (E_t, action_t, E_{t+1}) triples when surprise exceeds a threshold. This acts as automatic curriculum design — the training set is enriched for interesting transitions, not filled with steady-state noise.

from gazefield.triples import load_triples

# Load accumulated training data
batches = load_triples("gazefield_triples/")
for batch in batches:
    E_t = batch["embeddings_t"]       # [N, n_patches, dim]
    E_t1 = batch["embeddings_t1"]     # [N, n_patches, dim]
    actions = batch["actions"]         # list of action dicts
    surprises = batch["surprises"]     # [N] surprise scores

API reference

Core classes

Class Description
VisualPredictor Unified interface — EMA now, MLP upgrade path
EMAPredictor Time-weighted EMA with cold-start suppression
PatchMLPPredictor Shared-weight MLP with neighbor context (architecture defined)
PatchSurprise Dataclass: per-patch MSE, aggregate score, top-K patches, cold-start flag
DesktopAction Action with spatial binding (type + patch index + window class)
VJEPAExtractor V-JEPA 2 model loading and embedding extraction
TripleLogger Training triple buffer + NPZ/JSON serialization

Key functions

Function Description
compute_gates(surprise) Reference memory gate computation (alpha/theta/eta)
capture_screenshot() Primary monitor capture via mss
capture_action() Current cursor position + window class as DesktopAction
load_triples(directory) Load all training triples from NPZ + JSON files

Project structure

src/gazefield/
    __init__.py          # Public API exports
    predictor.py         # EMA + MLP predictors + VisualPredictor
    surprise.py          # PatchSurprise dataclass + compute_gates
    actions.py           # DesktopAction + cursor-to-patch mapping
    triples.py           # Training triple logging + I/O
    embeddings.py        # V-JEPA 2 extractor (optional dependency)
    capture.py           # Screenshot + cursor utilities (optional dependency)
examples/
    live_demo.py         # Real-time desktop surprise loop
tests/
    test_ema.py          # 12 tests covering core predictor behavior

Use cases

  • Desktop agents — Know where on screen something changed and whether it was expected, without computer vision pipelines
  • Test-time memory systems — Gate memory writes by surprise magnitude (high surprise = learn fast, low = coast)
  • Active perception — Only invoke expensive VLM analysis when the predictor says something unexpected happened
  • Spatial attention research — Study how embedding spaces encode visual change across applications, contexts, and usage patterns

Design philosophy

Bitter lesson compliant. No hand-coded rules about what should be surprising. The EMA computes it. The MLP will learn it. The video model's pre-training did the hard work.

Model-agnostic. gazefield works with any model that produces patch-structured embeddings. V-JEPA 2 is the reference, but DINO, VideoMAE, or future models plug in the same way.

Composable. gazefield outputs surprise signals. What you do with them — gate memory writes, trigger VLM calls, drive UI attention indicators — is up to you. No framework lock-in.


Contributing

Issues and PRs welcome. The main areas for contribution:

  • MLP training pipeline — The architecture is defined; the training loop needs implementation
  • Cross-platform capturecapture.py cursor/window helpers are currently Windows-only
  • Additional video models — DINO, VideoMAE, SigLIP extractors alongside V-JEPA 2
  • Benchmarks — Systematic evaluation across application types and desktop contexts

Citation

If you use gazefield in research:

@software{gazefield2026,
  author = {Gillespie, Daniel},
  title = {gazefield: Patch-level predictive surprise from video foundation model embeddings},
  year = {2026},
  url = {https://github.com/DanielGillespie278/gazefield}
}

License

MIT

About

Patch-level predictive surprise from video foundation model embeddings. The embedding delta is the attention signal.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages