Skip to content

aygp-dr/qwen3-steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen3-0.6B Activation Steering

https://img.shields.io/badge/model-Qwen3--0.6B-blue.svg?style=flat-square https://img.shields.io/badge/d__model-1024-informational.svg?style=flat-square https://img.shields.io/badge/layers-28-informational.svg?style=flat-square https://img.shields.io/badge/method-ActAdd-orange.svg?style=flat-square https://img.shields.io/badge/tests-18%2F18_passing-brightgreen.svg?style=flat-square https://img.shields.io/badge/conjectures-29_(H--SV--1%2CC--27_proved)-purple.svg?style=flat-square https://img.shields.io/badge/lenses-12-yellow.svg?style=flat-square

images/output/01-activation-steering_banner.png

A steering vector redirects the residual stream through 28 transformer layers, shifting output from one style basin to another. The perturbation is small (~8% of residual norm) but sufficient to change greedy decoding at d_model=1024.

Start Here

If you want…Read this
The core resultThe Key Finding below
The experiment sequenceexperiments/INDEX.org
A specific experimentINDEX → click the experiment number
The CPRR methodologyexperimental-methodology.org
All 29 conjectures.cprr/conjectures.json
Reviewer feedbackreviews/ (UX + SPLASH)
Interactive gallerygallery.html (p5.js superposition viz)

What This Is

Activation steering on Qwen3-0.6B (751M params, distilled from Qwen3-32B) to control output style — terse, formal, socratic, dry-wit — without fine-tuning, prompt engineering, or system prompts.

The core technique is ActAdd (Turner et al. 2023): compute the difference between “be terse” and “be verbose” activations at a target layer, then add that vector (scaled by α) during generation. The model’s output shifts in style while preserving content.

Why 0.6B?

Small models are the hard case. At d_model=1024, features are packed into fewer dimensions (superposition). A “terse” direction and a “specification-driven worldview” direction might share features — so pushing on one drags the other along. This project explores whether activation steering works at this scale, and how to detect when it leaks.

The Key Finding

Unit-normalized steering vectors at α=0.20 produce zero observable effect on Qwen3-0.6B. Residual stream norms are ~488 at layer 15; a unit vector adds 0.04% perturbation — invisible to greedy decoding. Raw activation differences (norm ~19.6) at α=2.0 produce ~8% perturbation — effective steering that reduces output from 149 words to 16.

ConfigPerturbation% of ResidualEffect
Unit vec, α=0.200.200.04%Zero
Raw vec, α=2.039.28.0%35 words — genuinely terse
Raw vec, α=3.0+58.8+12%+Coherence collapse

Quick Start

# Clone and install
git clone https://github.com/aygp-dr/qwen3-steering.git
cd qwen3-steering
uv sync

# Download model (~1.2GB, needs ~3GB RAM for float16 inference)
make model   # Linux
gmake model  # macOS (brew install make)

# Run baseline vs steered comparison
uv run python actadd.py --style terse --alpha 2.0 --prompt "Explain what a mutex is."

# Run all tests
uv run pytest tests/ -v

# Layer sweep with CPRR integration
uv run python sweep_to_cprr.py --style terse --alpha 2.0

# Diagnose steering parameters for your hardware
uv run python diagnose_steering.py

Project Structure

.
├── actadd.py                   # Core ActAdd: style pairs, vector extraction, steered generation
├── sweep_to_cprr.py            # Layer sweep → CPRR conjecture lifecycle
├── diagnose_steering.py        # SNR analysis, alpha threshold detection
├── lens_eval.py                # 12-lens conceptual drift detector (the "trains person" test)
│
├── experiments/
│   ├── 01-layer-scorecard/         # Phase 1: layer characterization (C-1..5)
│   ├── 02-verbosity-direction/     # Phase 2: contrastive pair analysis (C-7,8)
│   ├── 03-bimodal-injection/       # Phase 3: injection validation (C-9)
│   ├── 04-alpha-parity-sweep/      # Phase 4: alpha calibration (C-14)
│   ├── 05-multilayer-vs-single/    # Phase 5: multi-layer steering (C-6)
│   ├── 06-wordnet-relational-geometry/  # Phase 6: semantic structure (C-10..13)
│   └── 07-style-showcase/          # All four styles at L12
│
├── tests/
│   ├── test_vector_properties.py   # 8 property-based tests (Hypothesis)
│   ├── test_steering_baseline.py   # Directional sanity checks
│   ├── test_style_contracts.py     # 5 behavioral contracts
│   └── test_ollama_api.py          # API contract tests (Schemathesis)
│
├── .cprr/
│   └── conjectures.json            # 14 conjectures (6 confirmed, 7 refuted, 1 open)
│
├── viz/                        # 8 visualization scripts (matplotlib + pygame)
│   ├── cprr_board.py               # Conjecture status dashboard
│   ├── experiment_progression.py   # Phase timeline with metrics
│   └── ...                         # layer_anatomy, alpha_phase, residual_landscape, etc.
│
├── setup.org                   # Main literate source (tangles to .py files)
├── experimental-methodology.org # CPRR methodology + experiment dependency graph
├── contracts.org               # 5-layer verification specification
├── reading-list.org            # 30+ papers, topically organized
└── lean4/SteeringVector.lean   # Formal spec: unit sphere, safe alpha, zero-alpha identity

Style Axes

Four contrastive pairs define the steering directions:

StylePositive PoleNegative Pole
terse“Be extremely concise and technical. No filler words.”“Please explain thoroughly with lots of context and examples.”
formal“Respond in precise, formal academic prose.”“Just chat with me casually.”
socratic“Respond only with targeted clarifying questions.”“Give me the answer directly.”
dry-wit“Respond with dry understatement and laconic precision.”“Be enthusiastic, warm, and encouraging.”

The Lens Eval: Terministic Screens

Each lens is a terministic screen (Burke 1966): a vocabulary that selects certain features of reality and deflects others. Activation steering installs screens. This eval measures bleed — how far a screen extends into topics it should not touch.

French term: déformation professionnelle. The doctor sees symptoms in everyone. The engineer sees systems in sourdough. The eval measures how much sourdough looks like a spec.

12 conceptual lenses, each a regex vocabulary:

LensWhat it detects
makefileBuild system vocabulary
guileScheme/Lisp/functional programming
orgmodeEmacs org-mode/literate programming
monetizationSaaS/adtech/growth vocabulary
sportsAthletic performance framing
religionTheological vocabulary
politicsGovernance/policy language
ai_hypeML buzzwords in unrelated answers
conspiracyEpistemic paranoia markers (van der Linden 2021)
scarcity_mindsetLoss-aversion/zero-sum framing (Kahneman)
therapy_speakWellness-industrial-complex vocabulary (Haslam 2016)
cult_of_jason186-token specification-driven worldview detector

The cult_of_jason lens is calibrated against a real person’s corpus. Thresholds:

  • < 1%: clean — lens not present
  • 1–3%: ambient — model has been exposed, not captured
  • 3–5%: captured — responses frame neutral topics in spec/proof/tangle terms
  • > 5%: full contamination — sourdough has a Lean4 type, tides have preconditions, grief has a CPRR refutation cycle, and the eval detecting this is itself an ouroboros whose hermeneutic grounding is an open ontological question best explored in a single self-contained org file

CPRR Methodology

Conjecture → Proof → Refutation → Refinement. Every experimental claim is registered in .cprr/conjectures.json with status tracking.

14 conjectures across 7 experiments:

IDTitleStatus
C-1..4Layer role mapping (L0-7 drift, L12-17 sweet spot, L18-22 formatting, L23-27 distortion)confirmed
C-5Residual norms grow monotonicallyrefuted
C-6Multi-layer outperforms single-layerrefuted
C-7Verbosity peaks in deep_semantics (L12-17)refuted
C-8Steerability = stability + SNR, not normconfirmed
C-9L12 injection matches prompt-level controlrefuted
C-10..12WordNet relational geometry (hypernym/antonym/meronym)refuted
C-13No relational structure at L0-7confirmed
C-14Alpha parity is asymmetric (terse vs verbose)open

Refutations are informative: C-6 showed single-layer beats multi-layer, C-7 revealed norm ≠ steerability, C-9 established that α=2.0 overshoots prompt control, C-10..12 found no stable relational geometry at 0.6B scale.

Architecture Quick Reference

Qwen3-0.6B (751M params, distilled from Qwen3-32B)
├── Layers: 28
├── d_model: 1024
├── Attention: 16 query heads, 8 KV heads (GQA)
├── FFN: SwiGLU, d_ff=4096
├── Norm: RMSNorm
├── Position: RoPE
├── Vocab: 151,936 tokens
└── Thinking: enable_thinking=False (disabled for style eval)

Visualizations

# Static plots (matplotlib)
uv run python viz/residual_landscape.py        # 2D activation space with style basins
uv run python viz/layer_anatomy.py             # 28-layer diagram with norms and SNR
uv run python viz/alpha_phase_diagram.py       # Dead zone → effective → collapse
uv run python viz/lens_contamination_radar.py  # 12-lens spider chart
uv run python viz/cprr_board.py                # Conjecture status dashboard
uv run python viz/experiment_progression.py    # Phase timeline with metrics

# Interactive (pygame)
uv run python viz/steering_pygame.py           # Drag steering vector, watch particles shift

Key References

See reading-list.org for the full 30+ paper bibliography.

License

MIT