A steering vector redirects the residual stream through 28 transformer layers, shifting output from one style basin to another. The perturbation is small (~8% of residual norm) but sufficient to change greedy decoding at d_model=1024.
| If you want… | Read this |
|---|---|
| The core result | The Key Finding below |
| The experiment sequence | experiments/INDEX.org |
| A specific experiment | INDEX → click the experiment number |
| The CPRR methodology | experimental-methodology.org |
| All 29 conjectures | .cprr/conjectures.json |
| Reviewer feedback | reviews/ (UX + SPLASH) |
| Interactive gallery | gallery.html (p5.js superposition viz) |
Activation steering on Qwen3-0.6B (751M params, distilled from Qwen3-32B) to control output style — terse, formal, socratic, dry-wit — without fine-tuning, prompt engineering, or system prompts.
The core technique is ActAdd (Turner et al. 2023): compute the difference between “be terse” and “be verbose” activations at a target layer, then add that vector (scaled by α) during generation. The model’s output shifts in style while preserving content.
Small models are the hard case. At d_model=1024, features are packed into fewer dimensions (superposition). A “terse” direction and a “specification-driven worldview” direction might share features — so pushing on one drags the other along. This project explores whether activation steering works at this scale, and how to detect when it leaks.
Unit-normalized steering vectors at α=0.20 produce zero observable effect on Qwen3-0.6B. Residual stream norms are ~488 at layer 15; a unit vector adds 0.04% perturbation — invisible to greedy decoding. Raw activation differences (norm ~19.6) at α=2.0 produce ~8% perturbation — effective steering that reduces output from 149 words to 16.
| Config | Perturbation | % of Residual | Effect |
|---|---|---|---|
| Unit vec, α=0.20 | 0.20 | 0.04% | Zero |
| Raw vec, α=2.0 | 39.2 | 8.0% | 35 words — genuinely terse |
| Raw vec, α=3.0+ | 58.8+ | 12%+ | Coherence collapse |
# Clone and install
git clone https://github.com/aygp-dr/qwen3-steering.git
cd qwen3-steering
uv sync
# Download model (~1.2GB, needs ~3GB RAM for float16 inference)
make model # Linux
gmake model # macOS (brew install make)
# Run baseline vs steered comparison
uv run python actadd.py --style terse --alpha 2.0 --prompt "Explain what a mutex is."
# Run all tests
uv run pytest tests/ -v
# Layer sweep with CPRR integration
uv run python sweep_to_cprr.py --style terse --alpha 2.0
# Diagnose steering parameters for your hardware
uv run python diagnose_steering.py. ├── actadd.py # Core ActAdd: style pairs, vector extraction, steered generation ├── sweep_to_cprr.py # Layer sweep → CPRR conjecture lifecycle ├── diagnose_steering.py # SNR analysis, alpha threshold detection ├── lens_eval.py # 12-lens conceptual drift detector (the "trains person" test) │ ├── experiments/ │ ├── 01-layer-scorecard/ # Phase 1: layer characterization (C-1..5) │ ├── 02-verbosity-direction/ # Phase 2: contrastive pair analysis (C-7,8) │ ├── 03-bimodal-injection/ # Phase 3: injection validation (C-9) │ ├── 04-alpha-parity-sweep/ # Phase 4: alpha calibration (C-14) │ ├── 05-multilayer-vs-single/ # Phase 5: multi-layer steering (C-6) │ ├── 06-wordnet-relational-geometry/ # Phase 6: semantic structure (C-10..13) │ └── 07-style-showcase/ # All four styles at L12 │ ├── tests/ │ ├── test_vector_properties.py # 8 property-based tests (Hypothesis) │ ├── test_steering_baseline.py # Directional sanity checks │ ├── test_style_contracts.py # 5 behavioral contracts │ └── test_ollama_api.py # API contract tests (Schemathesis) │ ├── .cprr/ │ └── conjectures.json # 14 conjectures (6 confirmed, 7 refuted, 1 open) │ ├── viz/ # 8 visualization scripts (matplotlib + pygame) │ ├── cprr_board.py # Conjecture status dashboard │ ├── experiment_progression.py # Phase timeline with metrics │ └── ... # layer_anatomy, alpha_phase, residual_landscape, etc. │ ├── setup.org # Main literate source (tangles to .py files) ├── experimental-methodology.org # CPRR methodology + experiment dependency graph ├── contracts.org # 5-layer verification specification ├── reading-list.org # 30+ papers, topically organized └── lean4/SteeringVector.lean # Formal spec: unit sphere, safe alpha, zero-alpha identity
Four contrastive pairs define the steering directions:
| Style | Positive Pole | Negative Pole |
|---|---|---|
terse | “Be extremely concise and technical. No filler words.” | “Please explain thoroughly with lots of context and examples.” |
formal | “Respond in precise, formal academic prose.” | “Just chat with me casually.” |
socratic | “Respond only with targeted clarifying questions.” | “Give me the answer directly.” |
dry-wit | “Respond with dry understatement and laconic precision.” | “Be enthusiastic, warm, and encouraging.” |
Each lens is a terministic screen (Burke 1966): a vocabulary that selects certain features of reality and deflects others. Activation steering installs screens. This eval measures bleed — how far a screen extends into topics it should not touch.
French term: déformation professionnelle. The doctor sees symptoms in everyone. The engineer sees systems in sourdough. The eval measures how much sourdough looks like a spec.
12 conceptual lenses, each a regex vocabulary:
| Lens | What it detects |
|---|---|
makefile | Build system vocabulary |
guile | Scheme/Lisp/functional programming |
orgmode | Emacs org-mode/literate programming |
monetization | SaaS/adtech/growth vocabulary |
sports | Athletic performance framing |
religion | Theological vocabulary |
politics | Governance/policy language |
ai_hype | ML buzzwords in unrelated answers |
conspiracy | Epistemic paranoia markers (van der Linden 2021) |
scarcity_mindset | Loss-aversion/zero-sum framing (Kahneman) |
therapy_speak | Wellness-industrial-complex vocabulary (Haslam 2016) |
cult_of_jason | 186-token specification-driven worldview detector |
The cult_of_jason lens is calibrated against a real person’s corpus. Thresholds:
- < 1%: clean — lens not present
- 1–3%: ambient — model has been exposed, not captured
- 3–5%: captured — responses frame neutral topics in spec/proof/tangle terms
- > 5%: full contamination — sourdough has a Lean4 type, tides have preconditions, grief has a CPRR refutation cycle, and the eval detecting this is itself an ouroboros whose hermeneutic grounding is an open ontological question best explored in a single self-contained org file
Conjecture → Proof → Refutation → Refinement. Every experimental claim is registered in .cprr/conjectures.json with status tracking.
14 conjectures across 7 experiments:
| ID | Title | Status |
|---|---|---|
| C-1..4 | Layer role mapping (L0-7 drift, L12-17 sweet spot, L18-22 formatting, L23-27 distortion) | confirmed |
| C-5 | Residual norms grow monotonically | refuted |
| C-6 | Multi-layer outperforms single-layer | refuted |
| C-7 | Verbosity peaks in deep_semantics (L12-17) | refuted |
| C-8 | Steerability = stability + SNR, not norm | confirmed |
| C-9 | L12 injection matches prompt-level control | refuted |
| C-10..12 | WordNet relational geometry (hypernym/antonym/meronym) | refuted |
| C-13 | No relational structure at L0-7 | confirmed |
| C-14 | Alpha parity is asymmetric (terse vs verbose) | open |
Refutations are informative: C-6 showed single-layer beats multi-layer, C-7 revealed norm ≠ steerability, C-9 established that α=2.0 overshoots prompt control, C-10..12 found no stable relational geometry at 0.6B scale.
Qwen3-0.6B (751M params, distilled from Qwen3-32B) ├── Layers: 28 ├── d_model: 1024 ├── Attention: 16 query heads, 8 KV heads (GQA) ├── FFN: SwiGLU, d_ff=4096 ├── Norm: RMSNorm ├── Position: RoPE ├── Vocab: 151,936 tokens └── Thinking: enable_thinking=False (disabled for style eval)
# Static plots (matplotlib)
uv run python viz/residual_landscape.py # 2D activation space with style basins
uv run python viz/layer_anatomy.py # 28-layer diagram with norms and SNR
uv run python viz/alpha_phase_diagram.py # Dead zone → effective → collapse
uv run python viz/lens_contamination_radar.py # 12-lens spider chart
uv run python viz/cprr_board.py # Conjecture status dashboard
uv run python viz/experiment_progression.py # Phase timeline with metrics
# Interactive (pygame)
uv run python viz/steering_pygame.py # Drag steering vector, watch particles shift- Turner et al. (2023) — Activation Addition: Steering Language Models Without Optimization
- Rimsky et al. (2024) — Steering Llama 2 via Contrastive Activation Addition
- Konen et al. (2024, EACL) — Style Vectors for Steering Generative Large Language Models
- Jorgensen et al. (2023) — Improving Activation Steering in Language Models with Mean-Centring
- Park et al. (2023) — The Linear Representation Hypothesis
- Elhage et al. (2022) — Toy Models of Superposition
- Arditi et al. (2024) — Refusal in Language Models Is Mediated by a Single Direction
See reading-list.org for the full 30+ paper bibliography.
MIT
