Author: Joschka Braun
Description: This repository contains the code for my master's thesis research, which investigates why steering vectors are unreliable for many concepts.
This research explores the reliability of steering vectors in language models, focusing on understanding why they work well for some concepts but fail for others. The project includes various experiments that analyze activation patterns, convergence properties, and the effects of training data on steerability.
The repository is organized as follows:
.
├── config/ # Configuration files for experiments
├── data/ # Data storage for experiment results
├── datasets/ # Datasets used in experiments
│ ├── anthropic_evals/ # Anthropic evaluation datasets
│ ├── caa_datasets/ # Contrastive activation analysis datasets
│ └── random/ # Random datasets for control experiments
├── notebooks/ # Jupyter notebooks for exploratory analysis and visualization
├── scripts/ # Supporting scripts for data preparation and preprocessing
├── src/ # Core source code
│ ├── __init__.py
│ ├── experiments/ # Main experiment implementations
│ │ ├── anthropic_evals/ # Experiments using Anthropic evaluation datasets
│ │ ├── caa/ # Contrastive activation analysis experiments
│ │ ├── contrastive_free_form/ # Free-form contrastive experiments
│ │ └── reliability_paper/ # Experiments for reliability research
│ └── utils/ # Utility functions and helpers
│ ├── plotting_functions/ # Functions for visualization
│ ├── compute_steering_vectors.py
│ ├── compute_dimensionality_reduction.py
│ ├── evaluation_utils.py
│ └── ... (other utility modules)
├── tests/ # Tests for steering functionalities
├── pyproject.toml # Poetry configuration and dependencies
├── poetry.lock # Locked dependencies
├── LICENSE # License for the repository
└── README.md # Project documentation (this file)
The codebase includes several experiment types:
- Reliability Analysis: Investigating why steering vectors are reliable for some concepts but not others
- Convergence Analysis: Studying how steering vectors converge with different amounts of training data
- Contrastive Activation Analysis (CAA): Analyzing activation patterns across different models
- Anthropic Evaluations: Evaluating steering performance using Anthropic datasets
This project uses Python 3.12 and Poetry for dependency management. Follow these steps to set up the environment:
-
Clone the repository:
git clone https://github.com/JoschkaCBraun/steering-vector-reliability.git cd steering-vector-reliability -
Install dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
-
Set up environment variables: Create a
.envfile in the root directory with your API keys:HUGGINGFACE_TOKEN=your_huggingface_token OPENAI_API_KEY=your_openai_api_key
Most experiments can be run using Python scripts in the src/experiments/ directory. For example:
# Run contrastive activation analysis
python src/experiments/caa/plot_contrastive_activations_for_3_models_for_caa_dataset.py
# Analyze convergence of steering vectors
python src/experiments/reliability_paper/plot_convergence_of_steering_vectors/plot_convergence_of_steering_vectors.pyMany scripts accept command-line arguments to control sample sizes and other parameters.
Key dependencies include:
- PyTorch
- Transformers (Hugging Face)
- steering-vectors
- NumPy, Pandas, Matplotlib
- scikit-learn
Dev dependencies include pytest, mypy, black, ruff, and more, as specified in pyproject.toml.
This project is licensed under the MIT License. See the LICENSE file for details.