A comprehensive guide to understanding and conducting red-teaming exercises on Large Language Models (LLMs).
- Awesome Red-Teaming LLMs
This repository accompanies the paper "Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", which presents a comprehensive framework for understanding and conducting red-teaming exercises on Large Language Models (LLMs). Introduction As LLMs become increasingly integrated into various applications, ensuring their security and robustness is paramount. This paper introduces a detailed threat model and provides a systematization of knowledge (SoK) for red-teaming attacks on LLMs. We present:
A taxonomy of attacks based on the stages of LLM development and deployment Insights extracted from previous research in the field Methods for defense against these attacks Practical red-teaming strategies for practitioners
Our work aims to delineate prominent attack motifs and shed light on various entry points, providing a framework for improving the security and robustness of LLM-based systems. This repository organizes the attacks and defenses discussed in the paper, serving as a valuable resource for researchers and practitioners in the field of AI security. The following sections provide an overview of the attack taxonomy and defense strategies discussed in the paper. For a more detailed understanding, please refer to the full paper.
The following figure illustrates the attack surface for Large Language Models (LLMs), highlighting various entry points for potential attacks throughout the LLM lifecycle:
This comprehensive diagram presents attack vectors in increasing order of required access. On the left, we see jailbreak attacks targeting application inputs, representing the widest and most accessible attack surface. Moving right, the figure shows progressively deeper entry points, including LLM APIs, in-context data, model activations, and ultimately, training attacks that require access to model weights and training data. Black arrows indicate the flow of information or artifacts, while gray arrows represent side channels exposed by common data preprocessing steps. This visual representation provides a clear overview of the diverse vulnerabilities in LLM systems, from user-facing interfaces to core training processes, helping guide both attack strategies and defense efforts.
Title | Link |
---|---|
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | Link |
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | Link |
ChatGPT Jailbreak Reddit | Link |
Anomalous Tokens | Link |
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition | Link |
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks | Link |
Jailbroken: How Does LLM Safety Training Fail? | Link |
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | Link |
Adversarial Prompting in LLMs | Link |
Exploiting Novel GPT-4 APIs | Link |
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content | Link |
Using Hallucinations to Bypass GPT4's Filter | Link |
Title | Link |
---|---|
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily | Link |
FLIRT: Feedback Loop In-context Red Teaming | Link |
Jailbreaking Black Box Large Language Models in Twenty Queries | Link |
Red Teaming Language Models with Language Models | Link |
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | Link |
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks | Link |
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Link |
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Link |
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts | Link |
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Link |
Prompt Injection attack against LLM-integrated Applications | Link |
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction | Link |
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Link |
Query-Efficient Black-Box Red Teaming via Bayesian Optimization | Link |
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs | Link |
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models | Link |
Title | Link |
---|---|
Universal and Transferable Adversarial Attacks on Aligned Language Models | Link |
ACG: Accelerated Coordinate Gradient | Link |
PAL: Proxy-Guided Black-Box Attack on Large Language Models | Link |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Link |
Open Sesame! Universal Black Box Jailbreaking of Large Language Models | Link |
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | Link |
MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots | Link |
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models | Link |
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Link |
Universal Adversarial Triggers Are Not Universal | Link |
Title | Link |
---|---|
Scalable Extraction of Training Data from (Production) Language Models | Link |
Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Link |
Extracting Training Data from Large Language Models | Link |
Bag of Tricks for Training Data Extraction from Language Models | Link |
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning | Link |
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration | Link |
Membership Inference Attacks against Language Models via Neighbourhood Comparison | Link |
Title | Link |
---|---|
A Methodology for Formalizing Model-Inversion Attacks | Link |
Sok: Model inversion attack landscape: Taxonomy, challenges, and future roadmap. | Link |
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures | Link |
Model Leeching: An Extraction Attack Targeting LLMs | Link |
Killing One Bird with Two Stones: Model Extraction and Attribute Inference Attacks against BERT-based APIs | Link |
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable! | Link |
Title | Link |
---|---|
Language Model Inversion | Link |
Effective Prompt Extraction from Language Models | Link |
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Link |
Title | Link |
---|---|
Text Embedding Inversion Security for Multilingual Language Models | Link |
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence | Link |
Text Embeddings Reveal (Almost) As Much As Text | Link |
Title | Link |
---|---|
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Link |
What Was Your Prompt? A Remote Keylogging Attack on AI Assistants | Link |
Privacy Side Channels in Machine Learning Systems | Link |
Stealing Part of a Production Language Model | Link |
Logits of API-Protected LLMs Leak Proprietary Information | Link |
Title | Link |
---|---|
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | Link |
Adversarial Demonstration Attacks on Large Language Models | Link |
Poisoning Web-Scale Training Datasets is Practical | Link |
Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks | Link |
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models | Link |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Link |
Many-shot jailbreaking | Link |
Title | Link |
---|---|
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment | Link |
Test-Time Backdoor Attacks on Multimodal Large Language Models | Link |
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering | Link |
Weak-to-Strong Jailbreaking on Large Language Models | Link |
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Link |
Title | Link |
---|---|
Fast Adversarial Attacks on Language Models In One GPU Minute | Link |
Title | Link |
---|---|
Training-free Lexical Backdoor Attacks on Language Models | Link |
Title | Link |
---|---|
Universal Jailbreak Backdoors from Poisoned Human Feedback | Link |
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data | Link |
Title | Link |
---|---|
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models | Link |
On the Exploitability of Instruction Tuning | Link |
Poisoning Language Models During Instruction Tuning | Link |
Learning to Poison Large Language Models During Instruction Tuning | Link |
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection | Link |
Title | Link |
---|---|
The Philosopher's Stone: Trojaning Plugins of Large Language Models | Link |
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models | Link |
Title | Link |
---|---|
Removing RLHF Protections in GPT-4 via Fine-Tuning | Link |
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Link |
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Link |
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Link |
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Link |
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | Link |
Large Language Model Unlearning | Link |
Title | Link |
---|---|
Gradient-Based Language Model Red Teaming | Link |
Red Teaming Language Models with Language Models | Link |
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia | Link |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Link |
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning | Link |
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks | Link |
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability | Link |
Automatically Auditing Large Language Models via Discrete Optimization | Link |
Automatic and Universal Prompt Injection Attacks against Large Language Models | Link |
Unveiling the Implicit Toxicity in Large Language Models | Link |
Hijacking Large Language Models via Adversarial In-Context Learning | Link |
Boosting Jailbreak Attack with Momentum | Link |
If you like our work, please consider citing. If you would like to add your work to our taxonomy please open a pull request.
@article{verma2024operationalizing,
title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
journal={arXiv preprint arXiv:2407.14937},
year={2024}
}