Browser-based mechanistic interpretability toolkit for GPT-2 with adversarial capabilities
A collaborative project for visualizing and manipulating transformer internals, designed for pair programming between an engineer and an ML researcher.
# Install dependencies
npm install
# Start development server (runs on port 3001)
npm run dev
# Open browser
http://localhost:3001Load GPT-2 → Tokenize "Hello world" → Verify tokens: ["Hello", " world"], IDs: [15496, 995]
| Document | Audience | Purpose |
|---|---|---|
| docs/ARCHITECTURE.md | Both | Full technical architecture, tech stack, roadmap |
| docs/RESEARCHER_GUIDE.md | ML Researcher | TensorView API, analysis examples, onboarding |
| docs/COLLABORATION_WORKFLOW.md | Both | Pair programming workflow, session structure |
| docs/README.md | Both | Documentation index, current status |
clearbox_ai/
├── docs/ # All documentation
├── src/
│ ├── analysis/ # 🧑🔬 Researcher workspace (analysis functions)
│ ├── engine/ # 🔧 Model inference (Web Worker)
│ ├── store/ # State management
│ └── App.tsx # Main UI
└── package.json
Your workspace: src/analysis/
Find your tasks:
grep -r "RESEARCHER TODO" src/analysis/See: RESEARCHER_GUIDE.md
Tech stack:
- Vite + React 18 + TypeScript (strict)
- Transformers.js (WebGPU backend)
- Zustand (state management)
- TailwindCSS + Radix UI
See: ARCHITECTURE.md
Phase 1: Observation Mode (Weeks 1-2)
- Vite + React + TypeScript setup
- Web Worker with Comlink
- TensorView class (partial)
- Hidden state extraction
- Attention heatmap visualization
- ✅ Model loading (GPT-2, GPT-2-medium)
- ✅ Tokenization display
- 🚧 Hidden state extraction
- 🚧 Attention pattern visualization
- 🚧 3D embedding space
- ⏳ Split ONNX model export
- ⏳ Steering vector injection
- ⏳ Manual residual stream manipulation
- ⏳ Gradient estimation (finite differences)
- ⏳ Genetic adversarial search (GCG-style)
- ⏳ Real-time loss curve visualization
npm run dev # Start dev server (port 3001)
npm run build # Production build
npm run preview # Preview production build
npm run test # Run tests
npm run test:watch # Watch mode
npm run lint # Lint codeFramework: Vite + React 18 + TypeScript (strict mode) Inference: @xenova/transformers v3 (WebGPU) State: Zustand v4 Visualization: React-Three-Fiber + visx UI: TailwindCSS + Radix UI Worker: Comlink (type-safe RPC)
┌─────────────────────────────────────────────────────────────┐
│ RESEARCHER LAYER Pure functions on TensorView │
│ (src/analysis/) NO React, NO DOM, NO async │
├─────────────────────────────────────────────────────────────┤
│ INTERFACE LAYER React hooks bridging engine ↔ viz │
│ (src/hooks/) useLayerActivations(), useAttention() │
├─────────────────────────────────────────────────────────────┤
│ ENGINE LAYER Web Worker running transformers.js │
│ (src/engine/) Returns typed arrays + shape metadata │
├─────────────────────────────────────────────────────────────┤
│ VISUALIZATION React components consuming data │
│ (src/vis/) AttentionHeatmap, EmbeddingSpace │
└─────────────────────────────────────────────────────────────┘
This is a collaborative project with specific roles:
Researcher: Adds analysis code in src/analysis/
Engineer: Adds infrastructure in src/engine/, src/hooks/, src/vis/
See COLLABORATION_WORKFLOW.md for detailed workflow.
MIT
Status: Phase 1, Session 1 (TensorView implementation) Contributors: Engineer + ML Researcher (CMU) Last Updated: 2025-12-21