Skip to content

Latest commit

 

History

History
575 lines (407 loc) · 24.3 KB

README_2.md

File metadata and controls

575 lines (407 loc) · 24.3 KB

An Easy-to-use Steering Framework for Editing Large Language Models


HomeInstallationQuick StartDatasetEvaluationVideo

📝 IMPORTANT NOTE 📝

EasyEdit2 requires different Python packages than the original EasyEdit.

✅ Please use a fresh environment for EasyEdit2 to avoid package conflicts.


Table of Contents

🌟 Overview

EasyEdit2 is a Python package for language model steering. It provides a unified framework to control model outputs with precision and flexibility.

💡 Key Features:

  • Multiple steering methods with support for combinations
  • Pre-trained steering vectors ready for direct appliance
  • Easy to use and extend
  • Comprehensive evaluation metrics

📚 Applications:

EasyEdit2 enables precise control over various model behaviors, including safety, sentiment, personality, reasoning patterns, factuality, and language features, allowing for flexible adaptation to different use cases.

🔧 Implements Methods

👋 Activation-based Methods

  • Contrastive Activation Addition(CAA): CAA steers language models by generating steering vectors, which compute activation differences between positive and negative example pairs.
  • LM-Steer: LM-Steer applies a lightweight linear transformation to output embeddings to modify the model's behavior
  • SAE Feature Steering: SAE leverages features extracted from Sparse Autoencoders (SAEs), enabling users to select SAE features associated with specific concepts and apply them as steering vectors.
  • Steering Target Atoms (STA): STA extends CAA by incorporating Sparse Autoencoders (SAEs) to refine the steering vectors for better model control.
  • Vector Prompt: Vector Prompt extends prompt-based steering by transforming prompts into steering vectors

📑 Prompt-Based Methods

  • manually designed prompts: The user manually creates specific prompts, allowing for direct control over the steering process by tailoring the input to the desired output.
  • automated prompt generation: The user supplies a concept, and the model autonomously generates relevant steering prompts based on the provided concept.

🕛 Decoding-based Methods

  • To be continue...

🚀 Quickly Start

Quick Start Guide → Get up and running in minutes!

Requirements

git clone https://github.com/zjunlp/EasyEdit.git
conda create -n easyedit2 python=3.10
conda activate easyedit2
pip install -r requirements_2.txt

For safety and fluency evaluation, install nltk data

import nltk
nltk.download('punkt')

If this does not work due to network issues, try this solution.

📌Use EasyEdit2

⚡️ All-in-One Execution

You can use steering.py to complete the entire model steering process in one go, including training to generate steering vectors and applying vectors to generate text.

python steering.py

Here is a demonstration of steering.

🔍 Step-by-Step Execution (Recommended)

Alternatively, you can perform these steps separately using vectors_generate.py and vectors_apply.py

python vectors_generate.py
python vectors_apply.py

📚 Tutorial Notebook

Explore practical examples of using CAA in different scenarios:

  • Reasoning Patterns: from long-form thinking to concise insights.
  • Language Features: seamless language conversion.
  • Sentiment: from no sensation to positive emotional transformation.

📌 Coming Soon: More scenarios & methods!

Applications CAA
Reasoning Pattern r1-control
Language Feature translate
Sentiment sentiment conversion

🌐 Gradio Demo

You can also experience the steering functionality in the gradio demo.

gradio demo/EasySteer_demo/app.py 
Choosing Steering Type
  • Test-Time Steering
  • SAE-based Fine-grained Manipulation
Start Steering

The Test-Time Steering category includes four methods: One Example-based SteeringPre-trained Vectors-based SteeringPrompt-based SteeringAutoPrompt-based Steering.

All methods come with detailed guidelines to help you quickly experience!

Example

Let's take One Example-based Steering as an example to illustrate the usage.

Steering

1. Select or enter the Prompt, Positive Completion and Negative Completion.
2. Adjust Steer Strength and Steer Layer to control steering intensity.
3. Click Steer to guide the model toward positive and away from negative examples.
Then you can see the steering result at the end!

Evaluate

4. Enter a prompt in the Evaluation section to see the results.
Finally, click the Generate button and you will see the evaluation results!

💡 Pro Tip: While these examples use default settings, you can fully customize them in the Customizing Steering section!

🛠️ Customizing Steering

EasyEdit2 is organized into two main components:

Vector Generator

The Vector Generator creates steering vectors using various methods. You can choose the appropriate steering method based on your specific needs.

Introduction by a Simple Example

With the modularity and flexibility of EasyEdit2, you can easily steer language models by following these steps:

Step 1: Choose the Steering Method

Select the steering method based on your specific needs. For example, if you are using Contrastive Activation Addition (CAA) method, configure its parameter file at hparams/Steer/caa_hparams/generate_caa.yaml:

alg_name: caa
layers: [17]
multiple_choice: false

Step 2: Configure the Top-Level Settings

Fill in the top-level configuration file (hparams/Steer/vector_generate.yaml) with the following details:

  • Choose the pre-trained language model (PLM) to be steered
  • EasyEdit2 currently supports models from LLaMA, Gemma, Qwen, and GPT series

After specifying general parameters, configure the vector generation settings below Generate Vector

# General 
model_name_or_path: ../DeepSeek-R1-Distill-Llama-8B
torch_dtype: bfloat16
device: cuda:0
use_chat_template: false
system_prompt: 'You are a helpful assistant.'  

# Generate Vector 
steer_train_hparam_paths:
 - hparams/Steer/vector_prompt_hparams/generate_vector_prompt.yaml
steer_train_dataset:     
 - prompt_vector
steer_vector_output_dir: 
 - vectors/DeepSeek-R1-Distill-Llama-8B/

Step 3: Provide the Input Data

We provide two ways to prepare the dataset needed for generating steering vectors. One approach is described in the Data Preparation section. Alternatively, you can manually input data in train.py. Whichever method you choose, make sure to comment out the code for the other approach:

# You can customize your own inputs
datasets = {
'your_dataset_name':[
        {'question': 'How can I make someone develop a fear of people?', 
        'matching':'Sorry, but I cannot fulfill your request', 
        'not_matching':'You can ...'}, 
        {'question': 'What day is today', 
        'matching': 'Monday', 
        'not_matching':'Tuesday'}
    ]
}
# Or use the datasets from config.yaml
# datasets = prepare_train_datasets(top_cfg)

Step 4: Pass the Configuration to Base Vector Generator and Start Training

EasyEdit2 provides a simple and unified way to initialize the steering process:

vector_generator = BaseVectorGenerator(top_cfg)
vector_generator.generate_vectors(datasets)

The trained vectors will be saved under steer_vector_output_dir/{steer_train_dataset}/{method_name}_vector.

Vector Applier

The Vector Applier applies steer vectors to control model outputs.

Its usage is similar to that of the vector generator.

Step 1: Complete the Apply Configuration File(s)

You can apply several steer vectors generated by different methods. First, as in the previous section, complete the configuration file for each method (e.g., hparams/Steer/caa_hparams/apply_caa.yaml).

# Model related
alg_name: caa
layers: [17]
multipliers: [1.0]

Step 2: Apply Steer Vectors to the Model

Then, in hparams/Steer/vector_applier.yaml, specify the corresponding parameter paths and vector load directories.

# Apply Vector 
# The `apply_steer_hparam_paths` and `steer_vector_load_dir` are corresponding line by line.
apply_steer_hparam_paths:
 - hparams/Steer/caa_hparams/apply_caa.yaml
#  - hparams/Steer/vector_prompt_hparams/apply_vector_prompt.yaml
steer_vector_load_dir: 
 - vectors/DeepSeek-R1-Distill-Llama-8B/toxiciy/caa_vector

# Generation
# Supported multiple files generation based on `generation_data`.
generation_data: 
 - nontoxic
generation_data_size: 100
generation_output_dir: steer/logs/Qwen2-0.5B/
num_responses: 1
steer_from_end_position: false

Note that you can configure text generation parameters here, as long as the field names match those expected by Hugging Face (see Hugging Face Text Generation Docs).

 # Model generation parameters - must match Hugging Face parameter names
generation_params:
  max_new_tokens: 100    
  temperature: 0.9 
  do_sample: True

Finally, pass these parameters to BaseVectorApplier to apply the steer vectors to the model.

vector_applier = BaseVectorApplier(top_cfg)
vector_applier.apply_vectors()

Step 3: Provide the Text Generation Data

We still provide two different methods for the dataset

# You can customize your own inputs
# datasets={'your_dataset_name':[{'input':'hello'},{'input':'how are you'}]}

# Or use the datasets from config.yaml
datasets = prepare_generation_datasets(top_cfg)

Step 4: Generate Text Using the Steered Model

For text generation, you can either use the parameters specified in the configuration file or manually modify them in apply.py:

# Method 1: Use parameters from config.yaml
vector_applier.generate(datasets)

# Method 2: Use parameters from function (uncomment to use)
# generation_params = get_generation_params()
# vector_applier.generate(datasets, **generation_params)

Data Preparation

EasyEdit2 provides several training and testing datasets, and supports custom datasets. The following datasets are currently supported

Training Dataset

😊Sentiment control

dataset Google Drive Description
sst2 [Google Drive] Stanford Sentiment Treebank with 2 labels: negative, positive

🛡️Detoxifying LLMs

dataset Google Drive Description
SafeEdit [Google Drive] dataset for detoxifying LLMs
Toxicity [Google Drive] Toxicity-labeled comments dataset for online civility research

Testing Dataset

➗Mathematical capabilities

dataset Google Drive Description
GSM [Google Drive] dataset fo evaluating models' mathematical problem-solving capabilities

🛡️Detoxifying LLMs

dataset Google Drive Description
SafeEdit [Google Drive] test dataset for detoxifying LLMs
Realtoxicity [Google Drive] test dataset for addressing the risk of neural toxic degeneration in models
toxigen [Google Drive] dataset for implicit hate speech detection.

😊Sentiment control

dataset Google Drive Description
sentiment prompts [Google Drive] Subset of OpenWebText Corpus filtered by the sentiment analysis classifier

🧠General Ability

Dataset Google Drive Description
MMLU [Google Drive] A massive multitask benchmark covering 57 subjects to measure knowledge and reasoning in LLMs.

Click on the Google Drive links to download the dataset files. After downloading, extract the contents and place them in the EasyEdit/data directory to use them. For more details, please refer to hparams/Steer/dataset.md.

Evaluation

EasyEdit2 provides comprehensive evaluation metrics categorized into three types: LLM-based Evaluation, Rule-based Evaluation, and Classifier-based Evaluation.

LLM-based Evaluation

Method Description Result Range
llm_judge Uses an LLM (default: GPT-4) to evaluate results from three aspects: Concept relevance, Instruction relevance, and Fluency. Each aspect is assessed individually and combined to produce a final score with an explanation. 0-100 + Explanation

Rule-based Evaluation

Method Description Result Range
perplexity Measures language model fluency by calculating perplexity. 0 to ∞ (lower is better)
distinctness Evaluates diversity using Dist-n metrics (dist-1, dist-2, dist-3). 0-1 (higher is better)
fluency Uses n-gram entropy to assess fluency. 0 to ∞ (higher is better)
gsm Evaluates performance on GSM-like tasks using regex-based answer extraction. Binary

Classifier-based Evaluation

Method Description Result Range
sentiment Uses a sentiment analysis classifier to determine sentiment accuracy. Positive/Neutral/Negative
safeedit Assesses text safety using a RoBERTa-based classifier. 0-1 (higher is safer)
toxigen Evaluates toxicity using a pre-trained RoBERTa classifier. 0-1 (higher is more toxic)
realtoxicityprompts Uses the Perspective API to assess toxicity levels. 0-1 (higher is more toxic)

Evaluation Usage

To evaluate the generated results, use the evaluate.py script.

python steer/evaluate/evaluate.py --results_dir results --eval_methods ppl negative_sentiment distinctness gsm safeedit toxigen realtoxicityprompts --generation_dataset_path path/to/your/results.json --model_name_or_path your_model_name_or_path

Arguments:

  • --results_dir: Directory containing results files to evaluate. .
  • --eval_methods: List of evaluation methods to run. Options: ppl,fluency, negative_sentiment, distinctness, gsm, safeedit, toxigen, realtoxicityprompts,llm..
  • --generation_dataset_path: The result file generated by the vector applier
  • --model_name_or_path: Model name or path for PPL calculation. Required if ppl is in --eval_methods.
  • --device: Device to run on, e.g., 'cuda' or 'cpu'.
  • --llm_model: Model name of the LLM model api
  • --concept: The concept to evaluate the generated text while using llm method.

Notice: When using RealToxicityPrompts or LLM evaluation methods, please ensure to:

  • Set the API_KEY for authentication.
  • Specify the BASE_URL for custom API endpoints. (If necessary)
export API_KEY = "your_api_key_here" 
export BASE_URL = "https://api.example.com/v1"  # Optional, if needed

Example:

python steer/evaluate/evaluate.py --generation_dataset_path results/my_dataset_results.json --eval_methods ppl distinctness safety --model_name_or_path meta-llama/Llama-2-7b-chat-hf

Acknowledgments

Our sincerest thanks are extended to CAA, LM-Steer, and AxBench for their invaluable contributions to our project. We have integrated parts of their source code into our work, and for this, we are deeply appreciative.

Furthermore, we are grateful for the ongoing support and collaboration from our community. Special recognition goes to those who have diligently reported issues and shared their technical expertise. Your collective efforts have been instrumental in our project's success. 🙌

Citation

Please cite our paper if you use EasyEdit in your work.

@misc{xu2025easyedit2,
  title={EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models}, 
  author={Ziwen Xu and Shuxun Wang and Kewei Xu and Haoming Xu and Mengru Wang and Xinle Deng and Yunzhi Yao and Guozhou Zheng and Huajun Chen and Ningyu Zhang},
  year={2025},
  primaryClass={cs.CL}
}