Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.
-
Updated
Mar 18, 2026 - Python
Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.
Reproducible case study of pitfalls in contrastive SAE discovery and steering for "consciousness" features (GemmaScope SAEs, Gemma 3 4B/12B): reconstruction confound, delta-steering fix, matched controls, and false-positive scaling law vs dataset size.
Finding SAE features in Gemma 3 vision-language model — autonomous AI vs human-guided AI comparison using GemmaScope 2
Explore limitations of contrastive SAE steering in identifying causal consciousness features and introduce delta-steering to improve experiment validity.
Add a description, image, and links to the neuronpedia topic page so that developers can more easily learn about it.
To associate your repository with the neuronpedia topic, visit your repo's landing page and select "manage topics."