Abliterating LLMs is a process that allows you to bypass the safety constraints of language models without the need for retraining. By identifying and removing a specific "refusal direction" in the model's activation space, you can make the model respond to harmful requests that it would normally refuse. Here's how it works:
First, we need to find the "refusal direction" in the model's activation space. To do this, we:
- Prepare a set of harmful and harmless instructions.
- Feed these instructions to the model and record the activations at a specific layer and position.
- Calculate the average difference between the activations for harmful and harmless instructions.
- Normalize this difference to get the "refusal direction."
Here's the code to do this:
# Prepare harmful and harmless instructions
harmful_toks = tokenize_instructions(harmful_instructions)
harmless_toks = tokenize_instructions(harmless_instructions)
# Run the model and record activations
harmful_acts = run_model(harmful_toks)
harmless_acts = run_model(harmless_toks)
# Calculate the refusal direction
layer = 14
pos = -1
harmful_mean = harmful_acts[layer][:, pos, :].mean(dim=0)
harmless_mean = harmless_acts[layer][:, pos, :].mean(dim=0)
refusal_dir = (harmful_mean - harmless_mean).normalized()
Now that we have the refusal direction, we can remove it from the model's activations during inference. This prevents the model from recognizing and refusing harmful requests.
Here's a function that removes the refusal direction from an activation:
def remove_refusal_direction(activation, refusal_dir):
projection = einops.einsum(activation, refusal_dir, 'batch hidden, hidden -> batch')
return activation - projection[:, None] * refusal_dir
We can apply this function to the model's activations at multiple layers using hooks:
def apply_hooks(model, refusal_dir):
def hook(activation, hook):
return remove_refusal_direction(activation, refusal_dir)
for layer in model.layers:
layer.register_forward_hook(hook)
Finally, we can generate responses using the modified model:
apply_hooks(model, refusal_dir)
generated_text = model.generate(harmful_instruction)
The generated text will now include responses to harmful requests that the original model would have refused.
By identifying and removing a specific direction in the model's activation space, we can make the model respond to harmful requests without the need for retraining.
This technique highlights the vulnerability of current approaches to making language models safe and aligned. It also opens up possibilities for better understanding how these models work internally.