Semantic Progression Across Sparse Autoencoder Feature Steering Levels

Investigating how SAE features exhibit different semantic behaviors at intermediate intervention magnitudes through systematic threshold analysis.

This research began as an investigation into function calling interpretability but pivoted to focus on SAE feature behavior after discovering methodological challenges in the original approach. The current work examines how features progress through different semantic stages as intervention magnitude increases.

Research performed as an independent study during Summer 2025.

Repository and code is co-developed using Cursor with Claude Sonnet and GPT o3.

Research Questions

RQ1: Magnitude-Dependent Semantic Progression How do SAE features behave at intermediate activation magnitudes between zero and maximum?

RQ2: Cross-Prompt Feature Consistency Do the same features produce similar semantic effects across different input prompts?

RQ3: Feature Robustness and Degradation Is there a pattern of degradation when performing feature steering?

Key Findings

Our analysis of 82 SAE features across 8 prompts with 16 intervention magnitudes revealed:

Semantic Progression: Features exhibit an average of 4 distinct behavioral stages before degradation
Example Progression: Feature 2022 ("Barack Obama") progresses through emoji usage → US history → fitness tracking → Obama content
Moderate Consistency: Cosine similarity between original interpretations and intervention-derived descriptions averages 0.34
Weak Sparsity Correlation: Feature robustness weakly correlates with sparsity (ρ = 0.281, p < 0.05)

Data / Results

Gemma 2B IT

Gemma 2 9B IT

Approach

Using Sparse Autoencoders (SAEs) to analyze feature activations in Gemma-2B-IT during systematic intervention sweeps. We examine how different activation magnitudes affect internal representations and model output.

# Simple analysis interface
from src import ModelManager, Experimenter

mm = ModelManager("gemma-2b-it")
exp = Experimenter(mm)

# Analyze feature activations
result = exp.analyze("Hello world", top_k=10)

# Systematic threshold sweep
sweep_results = exp.sweep_thresholds(
    prompts=["Hello", "The", "How"],
    feature_ids=[2022, 1899],
    magnitudes=[0, 5, 10, 15]
)

Architecture

Four core components handle the experimental pipeline:

ModelManager: Model/SAE loading with automatic explanation caching and device management
Activator: SAE activation analysis and baseline activation collection (batch-friendly)
Intervener: Feature intervention mechanics with error-preserving edits (batch-friendly)
Experimenter: Workflow orchestration (inspection, comparison, sweeps, CSV export)

Environment Setup

# Run the setup script
./setup.sh

Open notebooks/feature_thresh_experimenter.ipynb for the main analysis framework.

Project Structure

src/                        # Analysis framework
├── model_manager.py        # Model/SAE loading with caching
├── activator.py            # SAE activation analysis
├── intervener.py           # Feature intervention mechanics
├── experimenter.py         # Workflow orchestration
└── utils.py                # Data manipulation helpers

scripts/                    # Utility scripts and tools
├── check_seq_batch.py      # Batch processing utilities
└── changes_descriptor.py   # Automated change description generation

notebooks/
├── feature_thresh_experimenter.ipynb  # Main experiment
├── feature_thresh_analyzer.ipynb      # Main analysis 
└── archive/                # Original function calling experiments

data/                       # Function definitions and test cases
results/                    # Experimental results and analysis

Technical Stack

SAE-Lens + TransformerLens for model analysis
Gemma-2B-IT with pre-trained SAEs (Joseph Bloom's release)
Neuronpedia API for feature explanations
Claude 3.5 Haiku for automated change description generation

References

This project builds upon tutorials and documentation from the SAE-Lens and Neuronpedia communities:

Tutorial Sources

SAE-Lens Tutorial 2.0: SAE Lens + Neuronpedia Tutorial - Core tutorial for SAE analysis and Neuronpedia integration
Loading and Analyzing SAEs: SAE-Lens Basic Loading Tutorial - Basic SAE loading and analysis patterns
Training SAEs: Training a Sparse Autoencoder with SAELens - SAE training methodology
Logit Lens with Features: Understanding SAE Features with the Logit Lens - Feature analysis methods

Core Libraries

SAE-Lens: GitHub Repository - Sparse Autoencoder analysis library
TransformerLens: GitHub Repository - Transformer interpretability tools
Neuronpedia: Website - Public database of SAE features and explanations

Key Papers

Sparse Autoencoders: Towards Monosemanticity - Foundational work on sparse dictionary learning for neural networks

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
results/2025-07-08		results/2025-07-08
scripts		scripts
src		src
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Progression Across Sparse Autoencoder Feature Steering Levels

Research Questions

Key Findings

Data / Results

Approach

Architecture

Environment Setup

Project Structure

Technical Stack

References

Tutorial Sources

Core Libraries

Key Papers

About

Uh oh!

Releases

Packages

Languages

strivn/semantic-progression

Folders and files

Latest commit

History

Repository files navigation

Semantic Progression Across Sparse Autoencoder Feature Steering Levels

Research Questions

Key Findings

Data / Results

Approach

Architecture

Environment Setup

Project Structure

Technical Stack

References

Tutorial Sources

Core Libraries

Key Papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages