Investigating how SAE features exhibit different semantic behaviors at intermediate intervention magnitudes through systematic threshold analysis.
This research began as an investigation into function calling interpretability but pivoted to focus on SAE feature behavior after discovering methodological challenges in the original approach. The current work examines how features progress through different semantic stages as intervention magnitude increases.
Research performed as an independent study during Summer 2025.
Repository and code is co-developed using Cursor with Claude Sonnet and GPT o3.
RQ1: Magnitude-Dependent Semantic Progression How do SAE features behave at intermediate activation magnitudes between zero and maximum?
RQ2: Cross-Prompt Feature Consistency Do the same features produce similar semantic effects across different input prompts?
RQ3: Feature Robustness and Degradation Is there a pattern of degradation when performing feature steering?
Our analysis of 82 SAE features across 8 prompts with 16 intervention magnitudes revealed:
- Semantic Progression: Features exhibit an average of 4 distinct behavioral stages before degradation
- Example Progression: Feature 2022 ("Barack Obama") progresses through emoji usage → US history → fitness tracking → Obama content
- Moderate Consistency: Cosine similarity between original interpretations and intervention-derived descriptions averages 0.34
- Weak Sparsity Correlation: Feature robustness weakly correlates with sparsity (ρ = 0.281, p < 0.05)
Gemma 2B IT
Gemma 2 9B IT
Using Sparse Autoencoders (SAEs) to analyze feature activations in Gemma-2B-IT during systematic intervention sweeps. We examine how different activation magnitudes affect internal representations and model output.
# Simple analysis interface
from src import ModelManager, Experimenter
mm = ModelManager("gemma-2b-it")
exp = Experimenter(mm)
# Analyze feature activations
result = exp.analyze("Hello world", top_k=10)
# Systematic threshold sweep
sweep_results = exp.sweep_thresholds(
prompts=["Hello", "The", "How"],
feature_ids=[2022, 1899],
magnitudes=[0, 5, 10, 15]
)Four core components handle the experimental pipeline:
- ModelManager: Model/SAE loading with automatic explanation caching and device management
- Activator: SAE activation analysis and baseline activation collection (batch-friendly)
- Intervener: Feature intervention mechanics with error-preserving edits (batch-friendly)
- Experimenter: Workflow orchestration (inspection, comparison, sweeps, CSV export)
# Run the setup script
./setup.shOpen notebooks/feature_thresh_experimenter.ipynb for the main analysis framework.
src/ # Analysis framework
├── model_manager.py # Model/SAE loading with caching
├── activator.py # SAE activation analysis
├── intervener.py # Feature intervention mechanics
├── experimenter.py # Workflow orchestration
└── utils.py # Data manipulation helpers
scripts/ # Utility scripts and tools
├── check_seq_batch.py # Batch processing utilities
└── changes_descriptor.py # Automated change description generation
notebooks/
├── feature_thresh_experimenter.ipynb # Main experiment
├── feature_thresh_analyzer.ipynb # Main analysis
└── archive/ # Original function calling experiments
data/ # Function definitions and test cases
results/ # Experimental results and analysis
- SAE-Lens + TransformerLens for model analysis
- Gemma-2B-IT with pre-trained SAEs (Joseph Bloom's release)
- Neuronpedia API for feature explanations
- Claude 3.5 Haiku for automated change description generation
This project builds upon tutorials and documentation from the SAE-Lens and Neuronpedia communities:
- SAE-Lens Tutorial 2.0: SAE Lens + Neuronpedia Tutorial - Core tutorial for SAE analysis and Neuronpedia integration
- Loading and Analyzing SAEs: SAE-Lens Basic Loading Tutorial - Basic SAE loading and analysis patterns
- Training SAEs: Training a Sparse Autoencoder with SAELens - SAE training methodology
- Logit Lens with Features: Understanding SAE Features with the Logit Lens - Feature analysis methods
- SAE-Lens: GitHub Repository - Sparse Autoencoder analysis library
- TransformerLens: GitHub Repository - Transformer interpretability tools
- Neuronpedia: Website - Public database of SAE features and explanations
- Sparse Autoencoders: Towards Monosemanticity - Foundational work on sparse dictionary learning for neural networks