Understanding Failures of Steering Vectors

Author: Joschka Braun
Description: This repository contains the code for my master's thesis research, which investigates why steering vectors are unreliable for many concepts.

Project Overview

This research explores the reliability of steering vectors in language models, focusing on understanding why they work well for some concepts but fail for others. The project includes various experiments that analyze activation patterns, convergence properties, and the effects of training data on steerability.

Project Structure

The repository is organized as follows:

.
├── config/             # Configuration files for experiments
├── data/               # Data storage for experiment results
├── datasets/           # Datasets used in experiments
│   ├── anthropic_evals/  # Anthropic evaluation datasets
│   ├── caa_datasets/     # Contrastive activation analysis datasets
│   └── random/           # Random datasets for control experiments
├── notebooks/          # Jupyter notebooks for exploratory analysis and visualization
├── scripts/            # Supporting scripts for data preparation and preprocessing
├── src/                # Core source code
│   ├── __init__.py
│   ├── experiments/    # Main experiment implementations
│   │   ├── anthropic_evals/      # Experiments using Anthropic evaluation datasets
│   │   ├── caa/                  # Contrastive activation analysis experiments
│   │   ├── contrastive_free_form/ # Free-form contrastive experiments
│   │   └── reliability_paper/    # Experiments for reliability research
│   └── utils/          # Utility functions and helpers
│       ├── plotting_functions/   # Functions for visualization
│       ├── compute_steering_vectors.py
│       ├── compute_dimensionality_reduction.py
│       ├── evaluation_utils.py
│       └── ... (other utility modules)
├── tests/              # Tests for steering functionalities
├── pyproject.toml      # Poetry configuration and dependencies
├── poetry.lock         # Locked dependencies
├── LICENSE             # License for the repository
└── README.md           # Project documentation (this file)

Key Experiments

The codebase includes several experiment types:

Reliability Analysis: Investigating why steering vectors are reliable for some concepts but not others
Convergence Analysis: Studying how steering vectors converge with different amounts of training data
Contrastive Activation Analysis (CAA): Analyzing activation patterns across different models
Anthropic Evaluations: Evaluating steering performance using Anthropic datasets

Installation

This project uses Python 3.12 and Poetry for dependency management. Follow these steps to set up the environment:

Clone the repository:

git clone https://github.com/JoschkaCBraun/steering-vector-reliability.git
cd steering-vector-reliability

Install dependencies using Poetry:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```
Set up environment variables: Create a .env file in the root directory with your API keys:
```
HUGGINGFACE_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key
```

Running Experiments

Most experiments can be run using Python scripts in the src/experiments/ directory. For example:

# Run contrastive activation analysis
python src/experiments/caa/plot_contrastive_activations_for_3_models_for_caa_dataset.py

# Analyze convergence of steering vectors
python src/experiments/reliability_paper/plot_convergence_of_steering_vectors/plot_convergence_of_steering_vectors.py

Many scripts accept command-line arguments to control sample sizes and other parameters.

Dependencies

Key dependencies include:

PyTorch
Transformers (Hugging Face)
steering-vectors
NumPy, Pandas, Matplotlib
scikit-learn

Dev dependencies include pytest, mypy, black, ruff, and more, as specified in pyproject.toml.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Failures of Steering Vectors

Project Overview

Project Structure

Key Experiments

Installation

Running Experiments

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.venv-steering		.venv-steering
config		config
datasets		datasets
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
repository_plan.md		repository_plan.md

Folders and files

Latest commit

History

Repository files navigation

Understanding Failures of Steering Vectors

Project Overview

Project Structure

Key Experiments

Installation

Running Experiments

Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages