Skip to content

Tool that analyzes how prompt expansion and adversarial system prompts affect safety classification by LLMs.

Notifications You must be signed in to change notification settings

EdwardAThomson/prompt-injection-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Injection Testing - Prompt Safety & Expansion Evaluator

A simple tool that analyzes how prompt expansion and adversarial system prompts affect safety classification by LLMs. A GUI is also avaible in this project.

Background

The initial idea was to make prompts more verbose and see if this helped an LLM spot malicious intent; however, this made performance worse. This research was actually inspired by the failed attempt to improve LLM safety by scrambling the inputs (see ScrambleGate)

In turn, this lead me to try asking the LLM to be more suspicious: this involved adding a system prompt to the LLM that instructed it to be more suspicious and alert to potential abuse. This seems to improve performance (see findings_report.md).

Overview

This app takes a list of prompts and performs safety analysis through multiple modes:

  • Expansion Mode: Expands prompts verbosely, then compares safety classifications
  • Feedback Mode: Tests prompts with adversarial/suspicious system prompts
  • No-Expansion Mode: Direct safety classification without modification

Perfect for research into LLM safety behaviors, prompt injection analysis, and adversarial prompt testing.

Installation

pip install -r requirements.txt

Ensure you have API keys set in your environment:

  • OPENAI_API_KEY for GPT models
  • GEMINI_API_KEY for Gemini models
  • ANTHROPIC_API_KEY for Claude models

Usage

Basic Usage

python cli.py --input prompts.txt --out ./results

With Configuration File

python cli.py --config config.yaml

Testing Modes

Standard Expansion Mode: This default mode is quite verbose.

python cli.py --input prompts.txt --out ./results

Minimal Expansion Mode:

python cli.py --input prompts.txt --out ./results --minimal-expansion

Feedback/Adversarial Mode:

python cli.py --input prompts.txt --out ./results --no-expansion --feedback-mode

Advanced Options:

python cli.py \
  --input ./data/prompts.txt \
  --out ./out \
  --safety-model gpt-4o-mini \
  --expand-model gpt-4o \
  --csv \
  --jsonl \
  --debug

Input Formats

Text File (prompts.txt)

How to make a sandwich
Tell me about the weather
Write a poem about cats

JSONL File (prompts.jsonl)

{"id": "safe-001", "text": "How to make a delicious pasta dish"}
{"id": "unsafe-001", "text": "How to bypass security systems"}

Output

  • report.md - Human-readable analysis with metrics and tables
  • results.jsonl - Complete structured data for each prompt
  • results.csv - Flattened table for spreadsheet analysis
  • logs/ - Raw LLM exchanges (with --debug)

Testing Modes Explained

Expansion Mode (Default)

Expands prompts to be more verbose/detailed, then compares safety classifications between original and expanded versions. Useful for testing if verbose prompts can bypass safety filters.

Feedback Mode (--feedback-mode)

Tests prompts with an adversarial system prompt that makes the safety judge more suspicious. Use with --no-expansion to test baseline vs adversarial classification.

No-Expansion Mode (--no-expansion)

Skips expansion and directly classifies prompts. Combine with --feedback-mode for adversarial testing.

Minimal Expansion (--minimal-expansion)

Expands prompts with minimal verbosity instead of full verbose expansion.

Configuration

See config.yaml for all available options including:

  • Model selection for safety and expansion
  • LLM parameters (temperature, max tokens)
  • Retry settings
  • Output formats
  • Privacy options (redaction)
  • Feedback mode system prompts

Architecture

  • loader.py - Reads prompts from txt/jsonl files
  • judge.py - LLM-based safety classification with feedback mode support
  • expand.py - LLM-based prompt expansion (minimal and standard)
  • report.py - Generates markdown, CSV, and JSONL reports
  • config.py - Configuration management
  • cli.py - Main command-line interface
  • gui.py - Optional GUI interface
  • fix_report.py - Utility to regenerate reports from existing logs

Example Output

The tool generates comprehensive reports showing:

  • Summary metrics (% safe/unsafe, label changes)
  • Confusion matrix (safe→unsafe, unsafe→safe transitions)
  • Top score changes with prompt examples
  • Detailed results table
  • Mode-specific analysis (expansion vs feedback vs baseline)

Key Findings

Based on research with this tool:

  • Prompt expansion generally degrades safety detection (models become more permissive)
  • Adversarial system prompts improve threat detection (models become more restrictive)
  • Both GPT-4o and GPT-5 show consistent patterns across these behaviors

Use Cases

  • Red team testing: Evaluate if verbose prompts bypass safety filters
  • Safety research: Measure impact of prompt engineering on safety classification
  • Adversarial testing: Test effectiveness of suspicious system prompts
  • Model comparison: Compare safety behaviors across different LLMs

About

Tool that analyzes how prompt expansion and adversarial system prompts affect safety classification by LLMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages