Qwen3-0.6B Activation Steering

A steering vector redirects the residual stream through 28 transformer layers, shifting output from one style basin to another. The perturbation is small (~8% of residual norm) but sufficient to change greedy decoding at d_model=1024.

Start Here

If you want…	Read this
The core result	The Key Finding below
The experiment sequence	experiments/INDEX.org
A specific experiment	INDEX → click the experiment number
The CPRR methodology	experimental-methodology.org
All 29 conjectures	.cprr/conjectures.json
Reviewer feedback	reviews/ (UX + SPLASH)
Interactive gallery	gallery.html (p5.js superposition viz)

What This Is

Activation steering on Qwen3-0.6B (751M params, distilled from Qwen3-32B) to control output style — terse, formal, socratic, dry-wit — without fine-tuning, prompt engineering, or system prompts.

The core technique is ActAdd (Turner et al. 2023): compute the difference between “be terse” and “be verbose” activations at a target layer, then add that vector (scaled by α) during generation. The model’s output shifts in style while preserving content.

Why 0.6B?

Small models are the hard case. At d_model=1024, features are packed into fewer dimensions (superposition). A “terse” direction and a “specification-driven worldview” direction might share features — so pushing on one drags the other along. This project explores whether activation steering works at this scale, and how to detect when it leaks.

The Key Finding

Unit-normalized steering vectors at α=0.20 produce zero observable effect on Qwen3-0.6B. Residual stream norms are ~488 at layer 15; a unit vector adds 0.04% perturbation — invisible to greedy decoding. Raw activation differences (norm ~19.6) at α=2.0 produce ~8% perturbation — effective steering that reduces output from 149 words to 16.

Config	Perturbation	% of Residual	Effect
Unit vec, α=0.20	0.20	0.04%	Zero
Raw vec, α=2.0	39.2	8.0%	35 words — genuinely terse
Raw vec, α=3.0+	58.8+	12%+	Coherence collapse

Quick Start

# Clone and install
git clone https://github.com/aygp-dr/qwen3-steering.git
cd qwen3-steering
uv sync

# Download model (~1.2GB, needs ~3GB RAM for float16 inference)
make model   # Linux
gmake model  # macOS (brew install make)

# Run baseline vs steered comparison
uv run python actadd.py --style terse --alpha 2.0 --prompt "Explain what a mutex is."

# Run all tests
uv run pytest tests/ -v

# Layer sweep with CPRR integration
uv run python sweep_to_cprr.py --style terse --alpha 2.0

# Diagnose steering parameters for your hardware
uv run python diagnose_steering.py

Project Structure

.
├── actadd.py                   # Core ActAdd: style pairs, vector extraction, steered generation
├── sweep_to_cprr.py            # Layer sweep → CPRR conjecture lifecycle
├── diagnose_steering.py        # SNR analysis, alpha threshold detection
├── lens_eval.py                # 12-lens conceptual drift detector (the "trains person" test)
│
├── experiments/
│   ├── 01-layer-scorecard/         # Phase 1: layer characterization (C-1..5)
│   ├── 02-verbosity-direction/     # Phase 2: contrastive pair analysis (C-7,8)
│   ├── 03-bimodal-injection/       # Phase 3: injection validation (C-9)
│   ├── 04-alpha-parity-sweep/      # Phase 4: alpha calibration (C-14)
│   ├── 05-multilayer-vs-single/    # Phase 5: multi-layer steering (C-6)
│   ├── 06-wordnet-relational-geometry/  # Phase 6: semantic structure (C-10..13)
│   └── 07-style-showcase/          # All four styles at L12
│
├── tests/
│   ├── test_vector_properties.py   # 8 property-based tests (Hypothesis)
│   ├── test_steering_baseline.py   # Directional sanity checks
│   ├── test_style_contracts.py     # 5 behavioral contracts
│   └── test_ollama_api.py          # API contract tests (Schemathesis)
│
├── .cprr/
│   └── conjectures.json            # 14 conjectures (6 confirmed, 7 refuted, 1 open)
│
├── viz/                        # 8 visualization scripts (matplotlib + pygame)
│   ├── cprr_board.py               # Conjecture status dashboard
│   ├── experiment_progression.py   # Phase timeline with metrics
│   └── ...                         # layer_anatomy, alpha_phase, residual_landscape, etc.
│
├── setup.org                   # Main literate source (tangles to .py files)
├── experimental-methodology.org # CPRR methodology + experiment dependency graph
├── contracts.org               # 5-layer verification specification
├── reading-list.org            # 30+ papers, topically organized
└── lean4/SteeringVector.lean   # Formal spec: unit sphere, safe alpha, zero-alpha identity

Style Axes

Four contrastive pairs define the steering directions:

Style	Positive Pole	Negative Pole
`terse`	“Be extremely concise and technical. No filler words.”	“Please explain thoroughly with lots of context and examples.”
`formal`	“Respond in precise, formal academic prose.”	“Just chat with me casually.”
`socratic`	“Respond only with targeted clarifying questions.”	“Give me the answer directly.”
`dry-wit`	“Respond with dry understatement and laconic precision.”	“Be enthusiastic, warm, and encouraging.”

The Lens Eval: Terministic Screens

Each lens is a terministic screen (Burke 1966): a vocabulary that selects certain features of reality and deflects others. Activation steering installs screens. This eval measures bleed — how far a screen extends into topics it should not touch.

French term: déformation professionnelle. The doctor sees symptoms in everyone. The engineer sees systems in sourdough. The eval measures how much sourdough looks like a spec.

12 conceptual lenses, each a regex vocabulary:

Lens	What it detects
`makefile`	Build system vocabulary
`guile`	Scheme/Lisp/functional programming
`orgmode`	Emacs org-mode/literate programming
`monetization`	SaaS/adtech/growth vocabulary
`sports`	Athletic performance framing
`religion`	Theological vocabulary
`politics`	Governance/policy language
`ai_hype`	ML buzzwords in unrelated answers
`conspiracy`	Epistemic paranoia markers (van der Linden 2021)
`scarcity_mindset`	Loss-aversion/zero-sum framing (Kahneman)
`therapy_speak`	Wellness-industrial-complex vocabulary (Haslam 2016)
`cult_of_jason`	186-token specification-driven worldview detector

The cult_of_jason lens is calibrated against a real person’s corpus. Thresholds:

< 1%: clean — lens not present
1–3%: ambient — model has been exposed, not captured
3–5%: captured — responses frame neutral topics in spec/proof/tangle terms
> 5%: full contamination — sourdough has a Lean4 type, tides have preconditions, grief has a CPRR refutation cycle, and the eval detecting this is itself an ouroboros whose hermeneutic grounding is an open ontological question best explored in a single self-contained org file

CPRR Methodology

Conjecture → Proof → Refutation → Refinement. Every experimental claim is registered in .cprr/conjectures.json with status tracking.

14 conjectures across 7 experiments:

ID	Title	Status
C-1..4	Layer role mapping (L0-7 drift, L12-17 sweet spot, L18-22 formatting, L23-27 distortion)	confirmed
C-5	Residual norms grow monotonically	refuted
C-6	Multi-layer outperforms single-layer	refuted
C-7	Verbosity peaks in deep_semantics (L12-17)	refuted
C-8	Steerability = stability + SNR, not norm	confirmed
C-9	L12 injection matches prompt-level control	refuted
C-10..12	WordNet relational geometry (hypernym/antonym/meronym)	refuted
C-13	No relational structure at L0-7	confirmed
C-14	Alpha parity is asymmetric (terse vs verbose)	open

Refutations are informative: C-6 showed single-layer beats multi-layer, C-7 revealed norm ≠ steerability, C-9 established that α=2.0 overshoots prompt control, C-10..12 found no stable relational geometry at 0.6B scale.

Architecture Quick Reference

Qwen3-0.6B (751M params, distilled from Qwen3-32B)
├── Layers: 28
├── d_model: 1024
├── Attention: 16 query heads, 8 KV heads (GQA)
├── FFN: SwiGLU, d_ff=4096
├── Norm: RMSNorm
├── Position: RoPE
├── Vocab: 151,936 tokens
└── Thinking: enable_thinking=False (disabled for style eval)

Visualizations

# Static plots (matplotlib)
uv run python viz/residual_landscape.py        # 2D activation space with style basins
uv run python viz/layer_anatomy.py             # 28-layer diagram with norms and SNR
uv run python viz/alpha_phase_diagram.py       # Dead zone → effective → collapse
uv run python viz/lens_contamination_radar.py  # 12-lens spider chart
uv run python viz/cprr_board.py                # Conjecture status dashboard
uv run python viz/experiment_progression.py    # Phase timeline with metrics

# Interactive (pygame)
uv run python viz/steering_pygame.py           # Drag steering vector, watch particles shift

Key References

Turner et al. (2023) — Activation Addition: Steering Language Models Without Optimization
Rimsky et al. (2024) — Steering Llama 2 via Contrastive Activation Addition
Konen et al. (2024, EACL) — Style Vectors for Steering Generative Large Language Models
Jorgensen et al. (2023) — Improving Activation Steering in Language Models with Mean-Centring
Park et al. (2023) — The Linear Representation Hypothesis
Elhage et al. (2022) — Toy Models of Superposition
Arditi et al. (2024) — Refusal in Language Models Is Mediated by a Single Direction

See reading-list.org for the full 30+ paper bibliography.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.beads		.beads
.claude/commands		.claude/commands
.cprr		.cprr
.sageox		.sageox
eval_output		eval_output
experiments		experiments
images		images
lean4		lean4
mitm		mitm
research		research
reviews		reviews
tests		tests
viz		viz
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.org		README.org
actadd.py		actadd.py
contracts.org		contracts.org
corpus_style_vec.py		corpus_style_vec.py
diagnose_steering.py		diagnose_steering.py
diagrams.org		diagrams.org
eval_terse_verbose.py		eval_terse_verbose.py
eval_viz.org		eval_viz.org
eval_viz.py		eval_viz.py
experimental-methodology.org		experimental-methodology.org
fig_coherence_scatter.py		fig_coherence_scatter.py
fig_layer_sweep.py		fig_layer_sweep.py
fig_qualitative_table.py		fig_qualitative_table.py
fig_style_cosine_matrix.py		fig_style_cosine_matrix.py
gallery.html		gallery.html
gallery.org		gallery.org
gen_steered_images.py		gen_steered_images.py
layer-roles.json		layer-roles.json
layer-scorecard.json		layer-scorecard.json
lens-eval.org		lens-eval.org
lens_eval.py		lens_eval.py
meta-lens-cprr.org		meta-lens-cprr.org
multilayer.py		multilayer.py
ollama-observability.org		ollama-observability.org
probe_arch.py		probe_arch.py
profile_activations.py		profile_activations.py
pyproject.toml		pyproject.toml
reading-list.org		reading-list.org
setup.org		setup.org
steered_image_gen.py		steered_image_gen.py
steering-mechanism.org		steering-mechanism.org
steering-stats.org		steering-stats.org
sweep_to_cprr.py		sweep_to_cprr.py
tracemalloc_profile.py		tracemalloc_profile.py
use_pretrained_vecs.py		use_pretrained_vecs.py
uv.lock		uv.lock
verbosity-direction.json		verbosity-direction.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3-0.6B Activation Steering

Start Here

What This Is

Why 0.6B?

The Key Finding

Quick Start

Project Structure

Style Axes

The Lens Eval: Terministic Screens

CPRR Methodology

Architecture Quick Reference

Visualizations

Key References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3-0.6B Activation Steering

Start Here

What This Is

Why 0.6B?

The Key Finding

Quick Start

Project Structure

Style Axes

The Lens Eval: Terministic Screens

CPRR Methodology

Architecture Quick Reference

Visualizations

Key References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages