Please install the following packages:
- torch
- sklearn
- sae-lens
- tuned-lens
- baukit
- nnsight
To steer models on our open-ended prompt dataset, run either
multitoken_generation.py or gpt_multitoken_generation.py.
Both of these scripts take as input:
-methodwhich specifies the interpretability method (options: "logit", "tuned", "sae", "steering", or "probing")-modelwhich specifies the model to be used (options: "llama2" or "gemma2" for multitoken_generation.py and "gpt2" for gpt_multitoken_generation.py)-intervention_phrasewhich specifies the feature to be intervened on (default: 'San Francisco')-alphawhich specifies the hyperparameter controlling the amount of intervention (please see Appendix for recommended values)-layer_idxwhich specifies the layer at which the interpretability method and intervention should be applied-generation_lengthwhich controls how many tokens to generate for each prompt (default: 30)-devicewhich specifies the device (default: "cuda")--test_cleanwhich returns the baseline models clean outputs (no explanation method is used and no intervention is applied)--test_bottleneckwhich applies the interpretability method by replacing x with x_hat without any intervention to z--promptingwhich should be used with--test_cleanand prompts the model to discuss the intervention feature
For example, to intervene on the phrase "San Francisco" for Llama2-7b with Logit Lens, run the following command: python multitoken_generation.py -model "llama2" -method "logit" -intervention_phrase "San Francisco" -alpha 6 -layer_idx 18
Note that for the prompting baseline, we recommend tuning the prompt template, which is currently hard-coded into multitoken_generation.py.
To evaluate the outputs of intervention for intervention success, coherence, and perplexity, run eval_outputs.py with the same inputs used for multitoken_generation.py or gpt_multitoken_generation.py
To train the steering vectors and probes, run steering_vector.py and probing.py respectively with the same inputs as above. The data used to train these for the experiments in the paper are in the data/ directory.