Skip to content

JoschkaCBraun/steering-vector-reliability

Repository files navigation

Understanding Failures of Steering Vectors

Author: Joschka Braun
Description: This repository contains the code for my master's thesis research, which investigates why steering vectors are unreliable for many concepts.

Project Overview

This research explores the reliability of steering vectors in language models, focusing on understanding why they work well for some concepts but fail for others. The project includes various experiments that analyze activation patterns, convergence properties, and the effects of training data on steerability.

Project Structure

The repository is organized as follows:

.
├── config/             # Configuration files for experiments
├── data/               # Data storage for experiment results
├── datasets/           # Datasets used in experiments
│   ├── anthropic_evals/  # Anthropic evaluation datasets
│   ├── caa_datasets/     # Contrastive activation analysis datasets
│   └── random/           # Random datasets for control experiments
├── notebooks/          # Jupyter notebooks for exploratory analysis and visualization
├── scripts/            # Supporting scripts for data preparation and preprocessing
├── src/                # Core source code
│   ├── __init__.py
│   ├── experiments/    # Main experiment implementations
│   │   ├── anthropic_evals/      # Experiments using Anthropic evaluation datasets
│   │   ├── caa/                  # Contrastive activation analysis experiments
│   │   ├── contrastive_free_form/ # Free-form contrastive experiments
│   │   └── reliability_paper/    # Experiments for reliability research
│   └── utils/          # Utility functions and helpers
│       ├── plotting_functions/   # Functions for visualization
│       ├── compute_steering_vectors.py
│       ├── compute_dimensionality_reduction.py
│       ├── evaluation_utils.py
│       └── ... (other utility modules)
├── tests/              # Tests for steering functionalities
├── pyproject.toml      # Poetry configuration and dependencies
├── poetry.lock         # Locked dependencies
├── LICENSE             # License for the repository
└── README.md           # Project documentation (this file)

Key Experiments

The codebase includes several experiment types:

  1. Reliability Analysis: Investigating why steering vectors are reliable for some concepts but not others
  2. Convergence Analysis: Studying how steering vectors converge with different amounts of training data
  3. Contrastive Activation Analysis (CAA): Analyzing activation patterns across different models
  4. Anthropic Evaluations: Evaluating steering performance using Anthropic datasets

Installation

This project uses Python 3.12 and Poetry for dependency management. Follow these steps to set up the environment:

  1. Clone the repository:

    git clone https://github.com/JoschkaCBraun/steering-vector-reliability.git
    cd steering-vector-reliability
  2. Install dependencies using Poetry:

    poetry install
  3. Activate the virtual environment:

    poetry shell
  4. Set up environment variables: Create a .env file in the root directory with your API keys:

    HUGGINGFACE_TOKEN=your_huggingface_token
    OPENAI_API_KEY=your_openai_api_key
    

Running Experiments

Most experiments can be run using Python scripts in the src/experiments/ directory. For example:

# Run contrastive activation analysis
python src/experiments/caa/plot_contrastive_activations_for_3_models_for_caa_dataset.py

# Analyze convergence of steering vectors
python src/experiments/reliability_paper/plot_convergence_of_steering_vectors/plot_convergence_of_steering_vectors.py

Many scripts accept command-line arguments to control sample sizes and other parameters.

Dependencies

Key dependencies include:

  • PyTorch
  • Transformers (Hugging Face)
  • steering-vectors
  • NumPy, Pandas, Matplotlib
  • scikit-learn

Dev dependencies include pytest, mypy, black, ruff, and more, as specified in pyproject.toml.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Repository for paper "Understanding (Un)Reliability of Steering Vectors in Language Models" by Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors