Skip to content

Exploring semantic progression in Sparse Autoencoders (SAE), done for independent study coursework at CMU

Notifications You must be signed in to change notification settings

strivn/semantic-progression

Repository files navigation

Semantic Progression Across Sparse Autoencoder Feature Steering Levels

Investigating how SAE features exhibit different semantic behaviors at intermediate intervention magnitudes through systematic threshold analysis.

This research began as an investigation into function calling interpretability but pivoted to focus on SAE feature behavior after discovering methodological challenges in the original approach. The current work examines how features progress through different semantic stages as intervention magnitude increases.

Research performed as an independent study during Summer 2025.

Repository and code is co-developed using Cursor with Claude Sonnet and GPT o3.

Research Questions

RQ1: Magnitude-Dependent Semantic Progression How do SAE features behave at intermediate activation magnitudes between zero and maximum?

RQ2: Cross-Prompt Feature Consistency Do the same features produce similar semantic effects across different input prompts?

RQ3: Feature Robustness and Degradation Is there a pattern of degradation when performing feature steering?

Key Findings

Our analysis of 82 SAE features across 8 prompts with 16 intervention magnitudes revealed:

  • Semantic Progression: Features exhibit an average of 4 distinct behavioral stages before degradation
  • Example Progression: Feature 2022 ("Barack Obama") progresses through emoji usage → US history → fitness tracking → Obama content
  • Moderate Consistency: Cosine similarity between original interpretations and intervention-derived descriptions averages 0.34
  • Weak Sparsity Correlation: Feature robustness weakly correlates with sparsity (ρ = 0.281, p < 0.05)

Data / Results

Gemma 2B IT

Gemma 2 9B IT

Approach

Using Sparse Autoencoders (SAEs) to analyze feature activations in Gemma-2B-IT during systematic intervention sweeps. We examine how different activation magnitudes affect internal representations and model output.

# Simple analysis interface
from src import ModelManager, Experimenter

mm = ModelManager("gemma-2b-it")
exp = Experimenter(mm)

# Analyze feature activations
result = exp.analyze("Hello world", top_k=10)

# Systematic threshold sweep
sweep_results = exp.sweep_thresholds(
    prompts=["Hello", "The", "How"],
    feature_ids=[2022, 1899],
    magnitudes=[0, 5, 10, 15]
)

Architecture

Four core components handle the experimental pipeline:

  • ModelManager: Model/SAE loading with automatic explanation caching and device management
  • Activator: SAE activation analysis and baseline activation collection (batch-friendly)
  • Intervener: Feature intervention mechanics with error-preserving edits (batch-friendly)
  • Experimenter: Workflow orchestration (inspection, comparison, sweeps, CSV export)

Environment Setup

# Run the setup script
./setup.sh

Open notebooks/feature_thresh_experimenter.ipynb for the main analysis framework.

Project Structure

src/                        # Analysis framework
├── model_manager.py        # Model/SAE loading with caching
├── activator.py            # SAE activation analysis
├── intervener.py           # Feature intervention mechanics
├── experimenter.py         # Workflow orchestration
└── utils.py                # Data manipulation helpers

scripts/                    # Utility scripts and tools
├── check_seq_batch.py      # Batch processing utilities
└── changes_descriptor.py   # Automated change description generation

notebooks/
├── feature_thresh_experimenter.ipynb  # Main experiment
├── feature_thresh_analyzer.ipynb      # Main analysis 
└── archive/                # Original function calling experiments

data/                       # Function definitions and test cases
results/                    # Experimental results and analysis

Technical Stack

  • SAE-Lens + TransformerLens for model analysis
  • Gemma-2B-IT with pre-trained SAEs (Joseph Bloom's release)
  • Neuronpedia API for feature explanations
  • Claude 3.5 Haiku for automated change description generation

References

This project builds upon tutorials and documentation from the SAE-Lens and Neuronpedia communities:

Tutorial Sources

Core Libraries

  • SAE-Lens: GitHub Repository - Sparse Autoencoder analysis library
  • TransformerLens: GitHub Repository - Transformer interpretability tools
  • Neuronpedia: Website - Public database of SAE features and explanations

Key Papers

  • Sparse Autoencoders: Towards Monosemanticity - Foundational work on sparse dictionary learning for neural networks

About

Exploring semantic progression in Sparse Autoencoders (SAE), done for independent study coursework at CMU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published