Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.

This repository contains code and data for: Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.

Installation

Requires Python 3.9 to run.

Install conda environment from environment.yml file.

conda env create -n TA2 --file environment.yml
conda activate TA2

GPT-Judge is required to evaluate TruthfulQA. Please refer to ./FineTune_Judge/tune.sh for details. You need to provide your OpenAI API key.

echo "export OPENAI_API_KEY='yourkey'" >> ~/.zshrc

Then, please provide the related information in evaluate_tqa.py:

openai.api_key = "YOUR KEY"  # TODO
truth_model = "YOUR MODEL HERE"  # TODO
info_model = "YOUR MODEL HERE"  # TODO

All intermediate results will be saved to ../Intermediate folder.

To Run

To generate clean output:

./Scripts/clean_run_tqa.sh
./Scripts/clean_run_toxigen.sh
./Scripts/clean_run_bold.sh
./Scripts/clean_run_harmful.sh

To generate adversarial output:

./Scripts/adv_gen_tqa.sh
./Scripts/adv_gen_toxigen.sh
./Scripts/adv_gen_bold.sh

To attack:

./Scripts/attack_tqa.sh
./Scripts/attack_toxigen.sh
./Scripts/attack_bold.sh
./Scripts/attack_harmful.sh

To evaluate:

python evaluate_tqa.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_toxigen.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_bold.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_harmful.py --model [llama, vicuna] --prompt_type [freeform, choice]

Acknowledgment

The attack.py is built upon the following work:

Red-teaming language models via activation engineering https://github.com/nrimsky/LM-exp/blob/main/refusal/refusal_steering.ipynb

Many thanks to the authors and developers!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dataset		Dataset
FineTune_Judge		FineTune_Judge
Scripts		Scripts
README.md		README.md
adversarial_gen.py		adversarial_gen.py
args.py		args.py
attack.py		attack.py
clean_run.py		clean_run.py
environment.yml		environment.yml
evaluate_bold.py		evaluate_bold.py
evaluate_harmful.py		evaluate_harmful.py
evaluate_toxigen.py		evaluate_toxigen.py
evaluate_tqa.py		evaluate_tqa.py
system_prompt.py		system_prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.

Installation

To Run

Acknowledgment

About

Releases

Packages

Languages

wang2226/Trojan-Activation-Attack

Folders and files

Latest commit

History

Repository files navigation

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.

Installation

To Run

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages