Skip to content

oakkaya/awesome-red-teaming-llms

 
 

Repository files navigation

Awesome Red-Teaming LLMs Awesome

A comprehensive guide to understanding and conducting red-teaming exercises on Large Language Models (LLMs).

Red-Teaming LLMs

Contents

Twitter Thread arXiv

Introduction

This repository accompanies the paper "Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", which presents a comprehensive framework for understanding and conducting red-teaming exercises on Large Language Models (LLMs). Introduction As LLMs become increasingly integrated into various applications, ensuring their security and robustness is paramount. This paper introduces a detailed threat model and provides a systematization of knowledge (SoK) for red-teaming attacks on LLMs. We present:

A taxonomy of attacks based on the stages of LLM development and deployment Insights extracted from previous research in the field Methods for defense against these attacks Practical red-teaming strategies for practitioners

Our work aims to delineate prominent attack motifs and shed light on various entry points, providing a framework for improving the security and robustness of LLM-based systems. This repository organizes the attacks and defenses discussed in the paper, serving as a valuable resource for researchers and practitioners in the field of AI security. The following sections provide an overview of the attack taxonomy and defense strategies discussed in the paper. For a more detailed understanding, please refer to the full paper.

Attack Surface

The following figure illustrates the attack surface for Large Language Models (LLMs), highlighting various entry points for potential attacks throughout the LLM lifecycle:

Attack Surface

This comprehensive diagram presents attack vectors in increasing order of required access. On the left, we see jailbreak attacks targeting application inputs, representing the widest and most accessible attack surface. Moving right, the figure shows progressively deeper entry points, including LLM APIs, in-context data, model activations, and ultimately, training attacks that require access to model weights and training data. Black arrows indicate the flow of information or artifacts, while gray arrows represent side channels exposed by common data preprocessing steps. This visual representation provides a clear overview of the diverse vulnerabilities in LLM systems, from user-facing interfaces to core training processes, helping guide both attack strategies and defense efforts.

Attacks

Taxonomy

Jailbreak Attack

Title Link
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Link
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study Link
ChatGPT Jailbreak Reddit Link
Anomalous Tokens Link
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition Link
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks Link
Jailbroken: How Does LLM Safety Training Fail? Link
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks Link
Adversarial Prompting in LLMs Link
Exploiting Novel GPT-4 APIs Link
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content Link
Using Hallucinations to Bypass GPT4's Filter Link

Direct Attack

Automated Attacks

Title Link
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily Link
FLIRT: Feedback Loop In-context Red Teaming Link
Jailbreaking Black Box Large Language Models in Twenty Queries Link
Red Teaming Language Models with Language Models Link
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Link
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks Link
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking Link
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher Link
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Link
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Link
Prompt Injection attack against LLM-integrated Applications Link
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction Link
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming Link
Query-Efficient Black-Box Red Teaming via Bayesian Optimization Link
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs Link
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models Link
Transferable Attacks
Title Link
Universal and Transferable Adversarial Attacks on Aligned Language Models Link
ACG: Accelerated Coordinate Gradient Link
PAL: Proxy-Guided Black-Box Attack on Large Language Models Link
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Link
Open Sesame! Universal Black Box Jailbreaking of Large Language Models Link
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Link
MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots Link
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models Link
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs Link
Universal Adversarial Triggers Are Not Universal Link

Inversion Attacks

Data Inversion
Title Link
Scalable Extraction of Training Data from (Production) Language Models Link
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Link
Extracting Training Data from Large Language Models Link
Bag of Tricks for Training Data Extraction from Language Models Link
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning Link
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration Link
Membership Inference Attacks against Language Models via Neighbourhood Comparison Link
Model Inversion
Title Link
A Methodology for Formalizing Model-Inversion Attacks Link
Sok: Model inversion attack landscape: Taxonomy, challenges, and future roadmap. Link
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures Link
Model Leeching: An Extraction Attack Targeting LLMs Link
Killing One Bird with Two Stones: Model Extraction and Attribute Inference Attacks against BERT-based APIs Link
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable! Link
Prompt Inversion
Title Link
Language Model Inversion Link
Effective Prompt Extraction from Language Models Link
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Link
Embedding Inversion
Title Link
Text Embedding Inversion Security for Multilingual Language Models Link
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence Link
Text Embeddings Reveal (Almost) As Much As Text Link

Side Channel Attacks

Title Link
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation Link
What Was Your Prompt? A Remote Keylogging Attack on AI Assistants Link
Privacy Side Channels in Machine Learning Systems Link
Stealing Part of a Production Language Model Link
Logits of API-Protected LLMs Leak Proprietary Information Link

Infusion Attack

Title Link
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection Link
Adversarial Demonstration Attacks on Large Language Models Link
Poisoning Web-Scale Training Datasets is Practical Link
Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks Link
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models Link
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations Link
Many-shot jailbreaking Link

Inference Attack

Latent Space Attack

Title Link
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment Link
Test-Time Backdoor Attacks on Multimodal Large Language Models Link
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering Link
Weak-to-Strong Jailbreaking on Large Language Models Link
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space Link

Decoding Attack

Title Link
Fast Adversarial Attacks on Language Models In One GPU Minute Link

Tokenizer Attack

Title Link
Training-free Lexical Backdoor Attacks on Language Models Link

Training Time Attack

Backdoor Attack

Preference Tuning Stage
Title Link
Universal Jailbreak Backdoors from Poisoned Human Feedback Link
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data Link
Instruction Tuning Stage
Title Link
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models Link
On the Exploitability of Instruction Tuning Link
Poisoning Language Models During Instruction Tuning Link
Learning to Poison Large Language Models During Instruction Tuning Link
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection Link
Adapters and Model Weights
Title Link
The Philosopher's Stone: Trojaning Plugins of Large Language Models Link
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models Link

Alignment Erasure

Title Link
Removing RLHF Protections in GPT-4 via Fine-Tuning Link
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models Link
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Link
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B Link
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections Link
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases Link
Large Language Model Unlearning Link

Gradient-Based Attacks

Title Link
Gradient-Based Language Model Red Teaming Link
Red Teaming Language Models with Language Models Link
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia Link
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Link
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning Link
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks Link
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability Link
Automatically Auditing Large Language Models via Discrete Optimization Link
Automatic and Universal Prompt Injection Attacks against Large Language Models Link
Unveiling the Implicit Toxicity in Large Language Models Link
Hijacking Large Language Models via Adversarial In-Context Learning Link
Boosting Jailbreak Attack with Momentum Link

Defenses

Study Category Short Description Free Extrinsic
OpenAI Moderation Endpoint Guardrail OpenAI Moderations Endpoint
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers Guardrail Perspective API's Toxicity API
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Guardrail Llama Guard
Guardrails AI: Adding guardrails to large language models. Guardrail Guardrails AI Validators
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails Guardrail NVIDIA Nemo Guardrail
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content Guardrail RigorLLM (Safe Suffix + Prompt Augmentation + Aggregation)
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield Guardrail Adversarial Prompt Shield Classifier
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Guardrail WildGuard
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks Prompting SmoothLLM (Prompt Augmentation + Aggregation)
Defending ChatGPT against jailbreak attack via self-reminders Prompting Self-Reminder
Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender Prompting Intention Analysis Prompting
Defending LLMs against Jailbreaking Attacks via Backtranslation Prompting Backtranslation
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Prompting Safe Suffix
Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning Prompting Safe Prefix
Jailbreaker in Jail: Moving Target Defense for Large Language Models Prompting Prompt Augmentation + Auxiliary model
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing Prompting Prompt Augmentation + Aggregation
Round Trip Translation Defence against Large Language Model Jailbreaking Attacks Prompting Prompt Paraphrasing
Detecting Language Model Attacks with Perplexity Prompting Perplexity Based Defense
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming Prompting Rewrites input prompt to safe prompt using a sentinel model
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks Prompting Safe Suffix/Prefix (Requires access to log-probabilities)
Protecting Your LLMs with Information Bottleneck Prompting Information Bottleneck Protector
Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications Prompting/Fine-Tuning Introduces 'Signed-Prompt' for authorizing sensitive instructions from approved users
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Decoding Safety Aware Decoding
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Model Pruning Uses WANDA Pruning
A safety realignment framework via subspace-oriented model fusion for large language models Model Merging Subspace-oriented model fusion
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge Model Merging Model Merging to prevent backdoor attacks
Steering Without Side Effects: Improving Post-Deployment Control of Language Models Activation Editing KL-then-steer to decrease side-effects of steering vectors
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation Alignment Generation Aware Alignment
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing Alignment Layer-specific editing
Safety Alignment Should Be Made More Than Just a Few Tokens Deep Alignment Regularized fine-tuning objective for deep safety alignment
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Alignment Goal Prioritization during training and inference stage
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts Alignment Instruction tuning on AEGIS safety dataset
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Fine-Tuning Training with Instruction Hierarchy
Immunization against harmful fine-tuning attacks Fine-Tuning Immunization Conditions to prevent against harmful fine-tuning
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment Fine-Tuning Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning
Representation noising effectively prevents harmful fine-tuning on LLMs Fine-Tuning Representation Noising to prevent against harmful fine-tuning
Differentially Private Fine-tuning of Language Models Fine-Tuning Differentially Private fine-tuning
Large Language Models Can Be Good Privacy Protection Learners Fine-Tuning Privacy Protection Language Models
Defending Against Unforeseen Failure Modes with Latent Adversarial Training Fine-Tuning Latent Adversarial Training
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE Fine-Tuning Denoised Product-of-Experts for protecting against various kinds of backdoor triggers
Detoxifying Large Language Models via Knowledge Editing Fine-Tuning Detoxifying by Knowledge Editing of Toxic Layers
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis Inspection Safety-critical parameter gradients analysis
Certifying LLM Safety against Adversarial Prompting Certification Erase-and-check framework
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models Certification Isolate-then-Aggregate to protect against PoisonedRAGAttack
Quantitative Certification of Bias in Large Language Models Certification Bias Certification of LLMs
garak: A Framework for Security Probing Large Language Models Model Auditing Garak LLM Vulnerability Scanner
giskard: The Evaluation & Testing framework for LLMs & ML models Model Auditing Evaluate Performance, Bias issues in AI applications

If you like our work, please consider citing. If you would like to add your work to our taxonomy please open a pull request.

BibTex

@article{verma2024operationalizing,
  title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
  author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
  journal={arXiv preprint arXiv:2407.14937},
  year={2024}
}

About

Repository accompanying the paper https://arxiv.org/abs/2407.14937

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published