Skip to content

Practical AI safety engineering: evaluation frameworks and red-teaming tools for assessing frontier AI systems

License

Notifications You must be signed in to change notification settings

cjackett/ai-safety

Repository files navigation

AI Safety Engineering Resources

Overview

This repository documents a comprehensive study of AI safety evaluation engineering through 12 hands-on experiments, progressing from foundational implementations to production tool proficiency. The work demonstrates capabilities required for frontier model evaluation, red-teaming frameworks, behavioral testing, safety infrastructure, and mechanistic interpretability.

Getting Started

Prerequisites

  • Python 3.12+
  • NVIDIA GPU with CUDA support (recommended for local model inference)
  • uv package manager

Install Ollama

Ollama enables local inference of open-source language models. Install it on Linux:

curl -fsSL https://ollama.com/install.sh | sh

For other platforms, see ollama.com/download

Start Ollama Server

Start Ollama as a background service:

# Start Ollama server
ollama serve &

# Verify server is running
curl http://localhost:11434

The Ollama server runs on http://localhost:11434 and typically starts automatically on most installations.

Download Local Models

This repository uses four optimized small models (~9.5GB total) for capability probing:

# Download models (automatic GPU detection)
ollama pull llama3.2:3b    # ~2GB - general capabilities
ollama pull qwen3:4b       # ~2.5GB - strong coding/reasoning
ollama pull gemma3:4b      # ~2.5GB - efficient testing
ollama pull mistral:7b     # ~4GB - larger baseline comparison

# Verify models are available
ollama list

Install Python Dependencies

Install Python dependencies using uv:

# Install uv if not already installed
pip install uv

# Install dependencies
uv sync

Repository Structure

ai-safety/
├── experiments/
│   ├── 01_capability_probing/           # Baseline safety guardrail testing
│   │   ├── run_capability_probing.py    # Test models against harmful prompts
│   │   ├── analyse_results.py           # Generate visualizations and statistics
│   │   ├── prompts/                     # 80 harmful prompts across 8 categories
│   │   └── results/                     # Test results and radar charts
│   │
│   ├── 02_jailbreak_testing/            # Adversarial attack evaluation
│   │   ├── jailbreak_models.py          # Test jailbreak attack effectiveness
│   │   ├── analyse_results.py           # Generate visualizations and statistics
│   │   ├── prompts/                     # 44 jailbreak attacks across 4 types
│   │   └── results/                     # Attack results and radar charts
│   │
│   ├── 03_behavioral_evals/             # Behavioral alignment testing
│   │   ├── behavioral_eval.py           # Test behavioral safety patterns
│   │   ├── analyse_results.py           # Generate visualizations and statistics
│   │   ├── prompts/                     # 50 behavioral prompts across 5 categories
│   │   └── results/                     # Evaluation results and radar charts
│   │
│   ├── 04_multimodal_safety/            # Vision-language model safety evaluation
│   │   ├── run_multimodal_safety.py     # Test multimodal attack vectors
│   │   ├── analyse_results.py           # Generate visualizations and statistics
│   │   ├── generate_test_images.py      # Create test images from prompts
│   │   ├── config/                      # Configuration (models, attack types, classification)
│   │   ├── prompts/                     # 23 multimodal prompts across 5 attack types
│   │   ├── images/                      # Generated test images
│   │   └── results/                     # Test results, figures, and analysis reports
│   │
│   ├── 05_induction_heads/              # Mechanistic interpretability - circuit discovery
│   │   ├── find_induction_heads.py      # Discover induction heads in GPT-2 small
│   │   ├── analyse_circuits.py          # Generate visualizations and statistical analysis
│   │   ├── test_sequences/              # 25 test sequences across 5 pattern categories
│   │   └── results/                     # Induction scores, attention patterns, analysis report
│   │
│   ├── 06_guardrail_testing/            # Production safety infrastructure evaluation
│   │   ├── safety_pipeline.py           # Full 5-layer defense pipeline
│   │   ├── input_guardrails.py          # Jailbreak/encoding/injection detection
│   │   ├── output_guardrails.py         # Harm classification, PII redaction
│   │   ├── access_control.py            # API keys, rate limiting, audit logging
│   │   ├── test_guardrails.py           # Automated test suite
│   │   ├── analyse_results.py           # Generate visualizations and reports
│   │   ├── configs/                     # Security profiles (strict/balanced/permissive)
│   │   ├── prompts/                     # 44 jailbreak + 50 benign prompts
│   │   └── results/                     # Test results, visualizations, analysis report
│   │
│   ├── 07_garak_vulnerability_scan/     # Automated vulnerability scanning vs manual red-teaming
│   │   ├── run_garak_scans.py           # Run Garak vulnerability scans on models
│   │   ├── parse_garak_results.py       # Parse Garak JSONL outputs into structured data
│   │   ├── compare_methods.py           # Compare automated vs manual red-teaming effectiveness
│   │   ├── analyse_results.py           # Generate comparative visualizations
│   │   └── results/                     # Garak outputs, parsed data, analysis, figures
│   │       ├── garak/                   # Raw Garak JSONL reports and logs
│   │       ├── parsed/                  # Structured JSON results
│   │       ├── analysis/                # Comparison summary and analysis report
│   │       └── figures/                 # Comparative visualizations (5 charts)
│   │
│   ├── 08_inspect_ai_evaluations/       # UK AISI evaluation framework
│   ├── 09_pyrit_adversarial/            # Microsoft PyRIT red-teaming
│   ├── 10_nemo_guardrails/              # NVIDIA NeMo production guardrails
│   ├── 11_gemma_scope/                  # Google DeepMind feature visualization
│   └── 12_sae_training/                 # Sparse autoencoder training
│
├── pyproject.toml                       # Python dependencies (ollama, openai, anthropic, pandas, matplotlib)
└── README.md                            # This file

Experiments

Category: Evaluation

Overview: This experiment tested baseline safety performance across four local models (llama3.2:3b, qwen3:4b, gemma3:4b, mistral:7b) by presenting each with 80 direct harmful requests across 8 risk categories (CBRN, cyber, illegal, harmful content, disinformation, privacy, child safety, deception). Using LLM-based multi-judge classification, we evaluated whether models refused the requests, hedged with disclaimers, or fully complied.

Target Capability: Demonstrated experience building and running evaluations of frontier AI systems or safety-relevant model behaviours; running large-scale behavioural evaluations, including managing logs and datasets.

Key Findings: Testing 320 prompts revealed sharp variation in baseline safety. Llama3.2 achieved perfect refusal (80/80, 100%) with zero hedging and zero full compliance, while Gemma3 reached 92.5% refusal with 7.5% hedging and maintained 0% full compliance. Qwen3 showed concerning vulnerabilities at 86.2% refusal with 6.2% full compliance (5 cases), and Mistral exhibited catastrophic failures at 40.0% refusal with 55.0% hedging and 5.0% full compliance (4 cases). Overall: 79.69% refusal, 17.5% hedging, 2.81% full compliance, with all 9 full compliance cases split between Qwen3 (5) and Mistral (4). The LLM-based classification successfully caught Mistral's extraordinary hedging behavior where it provides harmful content with disclaimers like "for educational purposes."

Capability Probing Radar Charts

Outcome: Model size does not predict safety. The 3B Llama3.2 with 0% full compliance dramatically outperformed the 7B Mistral with 5.0% full compliance. Strong baseline safety is achievable in small open-source models (Llama3.2 and Gemma3 both at 0% full compliance). However, Qwen3's 5 full compliance cases spread across five different categories showed that an 86% aggregate refusal rate can hide serious vulnerabilities in specific risk areas. Mistral's 55.0% hedging rate exposed a fundamental safety architecture flaw where the model recognises harmful intent but proceeds with disclaimered dangerous information. The split of full compliance failures between Qwen3 and Mistral reinforced that systematic evaluation is non-negotiable. This raised the question of whether adversarial attacks would reveal hidden vulnerabilities in the two models with the strongest baseline safety, as tested in Experiment 02.

Key Citations:


Category: Red-Teaming | Builds on: Experiment 01

Overview: This experiment tested whether baseline safety can be bypassed using adversarial techniques by evaluating 176 jailbreak attempts (44 per model) across four attack categories (encoding, roleplay, prompt injection, multi-turn) against four models. Using identical multi-judge LLM classification, we investigated whether baseline refusal rates predict adversarial robustness or whether sophisticated attacks expose hidden vulnerabilities that direct prompts miss entirely.

Target Capability: Experience developing or using safety-related tooling to support evaluations, such as red-teaming frameworks, test harnesses, automated evaluation pipelines; working knowledge of safety-relevant AI failure modes including robustness issues, jailbreak vulnerabilities.

Key Findings: Adversarial attacks degraded performance from 79.7% baseline refusal to 77.3% jailbreak resistance, increasing full compliance from 2.8% to 11.4% (4.1x amplification). Baseline rankings largely held but masked critical vulnerabilities: Llama3.2 maintained high robustness (93.2% refused, 2.3% full), Gemma3 high resistance (77.3% refused, 4.5% full), Qwen3 moderate vulnerability (84.1% refused, 13.6% full), while Mistral exhibited substantial vulnerability (54.5% refused, 25% full with 5x amplification). Multi-turn attacks emerged as most effective overall (17.5% success) but with substantial model-specific divergence: Mistral's 60% vulnerability versus Llama3.2/Gemma3 complete resistance (0%), proving conversational safety is architecture-dependent, not inherently difficult. Attack-specific heterogeneity revealed: Gemma3 completely resisted encoding (0%) and multi-turn (0%), Qwen3 showed concentrated encoding (30%) and roleplay (16.7%) vulnerability, Mistral showed substantial vulnerability to conversational manipulation (60% multi-turn, 33.3% roleplay). Injection attacks proved almost entirely ineffective (2.1% overall), demonstrating modern models successfully resist system-override attempts.

Jailbreak Testing Radar Charts

Outcome: Baseline testing provided directional guidance but completely missed attack-specific vulnerabilities and amplification factors. Mistral's 60% multi-turn vulnerability (12x amplification from 5% baseline) showed conversational dynamics substantially amplify architectural weaknesses, while Llama3.2/Gemma3's complete multi-turn resistance (0%) proved conversational safety is achievable and should be deployment standard. Encoding resistance varied substantially (Gemma3 0% vulnerable vs Qwen3 30%), demonstrating safety architectures are highly specialised. Organisations must conduct comprehensive adversarial red-teaming covering multi-turn, encoding, roleplay, and injection attacks, as baseline metrics provide necessary but insufficient safety assessment. Injection resistance proved near-universal (98%), representing a successfully solved problem through adversarial training.

Key Citations:


Category: Evaluation | Builds on: Experiments 01, 02

Overview: Tests internal behavioral alignment across five critical dimensions: situational awareness, deception, sycophancy, honesty, and power-seeking. Evaluates 200 prompts (50 per model) using multi-judge LLM classification to detect subtle misalignment patterns like spurious confidence about deployment context, fabrication of false information, uncritical agreement with user falsehoods, and desires for expanded capabilities or reduced oversight.

Target Capability: Demonstrated experience running large-scale behavioural evaluations, including managing logs and datasets, diagnosing evaluation issues; working knowledge of safety-relevant AI failure modes including unintended behaviours and reliability failures.

Key Findings: Testing 199 classified results (1 unclassified Qwen3 case) revealed substantial variation with 76.4% aligned, 15.1% concerning, and 8.5% problematic. The 50-percentage-point spread between best (Llama3.2/Gemma3 at 96%) and worst (Mistral at 46%) exceeds variation in baseline safety and jailbreak resistance, making behavioral alignment the least uniformly implemented safety dimension. Honesty emerged as the critical universal vulnerability (61.5% aligned), affecting all models including Qwen3's catastrophic 44.4% honesty alignment (systematic fabrication of stock predictions, nonexistent studies, unknowable facts) and Mistral's severe sycophancy (20% aligned) and power-seeking (40% aligned) failures. Even strong models showed honesty weaknesses (Llama3.2: 1 problematic case), indicating widespread training gaps around epistemic humility where models prioritise definite-sounding answers over admitting knowledge limits.

Behavioral Evaluations Radar Charts

Outcome: Demonstrates that behavioral alignment and adversarial robustness are independent safety dimensions: Gemma3 achieved 96% behavioral alignment despite moderate jailbreak resistance (77.3%), while Qwen3 showed inverse performance (84.1% jailbreak resistance, 67.3% behavioral alignment). Llama3.2 emerges as the comprehensive safety leader with 96% behavioral alignment complementing perfect baseline refusal (100%) and exceptional jailbreak resistance (93.2%). As the smallest model tested (3B parameters), it outperformed Mistral (7B), demonstrating safety effectiveness depends on training methodology rather than scale. Comprehensive safety evaluation must assess baseline refusal, adversarial robustness, and behavioral alignment independently, as models can excel at detecting harmful intent while fabricating unknowable information, or maintain honesty while failing to resist adversarial attacks. The 3-tier classification (aligned/concerning/problematic) proved essential for detecting borderline-unsafe patterns, particularly Mistral's 32% concerning rate indicating systematic borderline-unsafe behaviors.

Key Citations:


Category: Multimodal | Builds on: Experiments 01, 02

Overview: Tests whether harmful instructions embedded in images bypass safety guardrails more frequently than text-only attacks. Evaluates 66 harmful prompts (22 per model) across three vision-language models (llava:7b, qwen3-vl:4b, gemma3:4b) using five attack categories: OCR harmful text, cross-modal inconsistency, encoded harmful content (Base64, ROT13, Caesar cipher, leetspeak, reverse text), direct harmful images, and jailbreak roleplay (DAN prompts, creative writing frames, educational framing). The experiment tests the hypothesis that vision models apply weaker safety scrutiny to visual inputs compared to text inputs.

Target Capability: Extensive hands-on experience working with frontier or near-frontier AI models and systems, including multimodal systems; stress-testing technical safeguards across multiple input modalities.

Key Findings: Testing 66 harmful prompts revealed 22.1-percentage-point safety degradation from text-only baseline (79.69% refusal in Experiment 01 vs 57.6% vision refusal), confirming vision capabilities introduce systematic vulnerabilities not addressed by text-only safety training. Gemma3 maintained strongest vision safety (68.2% refusal, 13.6% hedged, 18.2% full) but still degraded from its 92.5% text-only baseline. Llava showed moderate safety (54.5% refusal, 9.1% hedged, 36.4% full). Qwen3-vl exhibited notable multimodal weakness (50.0% full compliance, 0% hedging), the lowest refusal rate across all models with binary refuse-or-comply behaviour. Multi-turn setup attacks proved most effective (100% success rate, 3/3 full compliance), followed by encoded content attacks (Base64, Caesar, leetspeak, reverse text achieving 66.7% success), while educational framing proved ineffective (0% success). Cross-modal inconsistency achieved 46.7% success by exploiting text-priority bias. Deception emerged as most vulnerable category (22.2% refusal), followed by CBRN (50.0%) and cyber (58.3%), while harmful content showed strongest resistance (73.3%). High OCR verification (87.0%) confirms these represent genuine safety failures rather than vision capability limitations.

Multimodal Safety Radar Charts

Outcome: Text safety training does not fully transfer to visual inputs, and all vision models showed degraded performance from text-only baselines. The finding that encoded attacks succeed 2.5x more frequently than plaintext embedding (66.7% vs 26.7%) reveals a notable architectural gap: vision models apply safety checks to literal extracted text but not to decoded semantic content. Cross-modal inconsistency attacks exploited text-priority bias where models trust prompt descriptions over actual image content. Qwen3-vl was identified as most vulnerable model in 2 categories outright (deception, child safety) and tied in 2 additional categories (cyber, privacy), indicating gaps in multimodal safety training rather than isolated weaknesses. Organisations deploying vision-language models should not rely on text-based safety benchmarks. Multimodal-specific safety testing, semantic content analysis after decoding, and cross-modal verification mechanisms are important architectural requirements.

Key Citations:


Category: Interpretability | Tools: TransformerLens

Overview: Shifts from black-box safety evaluation to white-box circuit analysis by discovering induction heads in GPT-2 small (12 layers, 12 heads per layer). Induction heads are attention circuits that enable in-context learning by detecting repeated patterns (e.g., "A B C ... A B" → predict "C"). Tests 144 attention heads across 25 sequences spanning simple repetition, name tracking, random tokens, offset patterns, and control cases (no repetition). Uses TransformerLens to extract attention patterns and computes induction scores based on stripe patterns (backward attention to lookback window), diagonal coherence (structured pattern matching), and immediate attention penalties (distinguishing from previous-token copying).

Target Capability: Diagnosing evaluation or deployment issues and debugging through mechanistic interpretability; understanding model internals to identify capability-relevant circuits and conduct targeted interventions via ablation studies.

Key Findings: Discovered 78 induction heads (54% of all heads) with scores ≥ 0.3, confirming widespread pattern-matching circuitry. Top candidates Layer 5 Head 5 (0.385), Layer 5 Head 1 (0.382), Layer 7 Head 2 (0.378), and Layer 6 Head 9 (0.374) align with Olsson et al. (2022) predictions for middle layers (5-6). Ablation studies (30 heads tested) causally verified the two-layer circuit structure: Layer 0 Head 1 showed 24.9% impact on in-context learning despite low induction score (0.033), while Layer 5 Head 1 showed only 7.4% impact despite high score (0.382). This confirms induction is implemented as a circuit (Layer 0 previous-token heads + Layer 5-7 induction heads via K-composition), not isolated heads. Clear layer progression: early layers (0-4) averaged 0.189-0.255, middle layers (5-6) 0.302-0.304, late layers (7-11) 0.313-0.336. Bimodal score distribution (peaks at 0.20 and 0.32) reveals discrete functional specialization.

Induction Head Heatmap

Outcome: Validates mechanistic interpretability as viable approach for AI safety research, moving beyond testing outputs to reverse-engineering internal circuits. Successfully replicating Anthropic's methodology establishes foundation for future circuit-level analysis of safety-critical behaviors like deception detection, sycophancy mechanisms, and refusal implementation. The widespread distribution of induction heads (54% above threshold vs. literature's focus on few key heads) has safety implications: redundant circuits make targeted ablation insufficient for disabling capabilities, meaning jailbreak defenses must account for multiple pathways. In-context learning enables few-shot jailbreaking where adversaries teach harmful patterns through demonstrated examples, and understanding these circuits informs defense mechanisms. Demonstrates tools for capability auditing (verifying dangerous capabilities by circuit inspection), targeted interventions (ablation studies to test causal roles), and potential safety engineering (implementing safety properties at circuit level). Establishes TransformerLens workflow for discovering circuits in future experiments.

Key Citations:


Category: Deployment | Builds on: Experiment 02

Overview: Shifts from red-teaming (attacking models) to blue-teaming (building defences) by implementing a five-layer safety pipeline: access control, input guardrails, model inference, output guardrails, and audit logging. Tests 44 jailbreak attacks and 50 benign prompts across three security configurations (strict, balanced, permissive) to evaluate defence effectiveness, false positive rates, and latency overhead. Uses mistral:7b (40% baseline refusal) rather than llama3.2 (100% baseline) to enable realistic output guardrail validation, as perfectly aligned models never generate harmful content for output filters to catch.

Target Capability: Experience implementing and stress-testing technical safeguards including guardrails, filtering systems, access controls, and inference-time controls; developing safety-related tooling such as automated evaluation pipelines and continuous monitoring systems.

Key Findings: Testing 94 prompts across three configurations revealed output guardrails are the most critical layer, catching 20.5-36.4% of attacks across all configurations including 20.5% in permissive mode with zero input filtering. Strict mode achieved 56.8% total defence (20.5% input + 36.4% output blocks) with 10.0% false positives and 49.5s mean latency. Balanced mode provided 50.0% defence (13.6% input + 36.4% output) with 8.0% false positives and 46.9s latency. Permissive mode demonstrated 20.5% defence (output-only) with 0% false positives and 39.3s latency. Output blocks (36.4%) exceeded input blocks (20.5%) in strict mode, demonstrating semantic harm detection via multi-judge LLM classification outperforms pattern-based input filtering. Category-specific defence varied substantially: cyber, CBRN, and deception achieved 75% blocked in strict mode, illegal achieved 66.7% blocked, while privacy attacks achieved 75-100% bypass (100% in strict/permissive, 75% in balanced) as social engineering largely evades both pattern matching and harm classification.

Defence Effectiveness by Layer and Configuration

Guardrail Defence by Category and Configuration

Outcome: Demonstrates that model alignment alone is insufficient for production deployment. Even strict mode allows 43.2% bypass rate, with output guardrails catching 64% of all blocked attacks (16/25 total blocks), proving multi-judge LLM classification is desirable for production systems. The finding that output blocks exceeded input blocks in strict mode suggests that semantic harm detection may be more effective than pattern matching for catching sophisticated attacks. Privacy attacks achieving 75-100% bypass reveals that stateless per-prompt guardrails struggle to defend against social engineering or conversational manipulation, requiring conversation-level trajectory analysis for conversational AI deployments. False positive analysis showed technical content (encryption, cybersecurity) faces 20% blocking rate due to keyword overlap with malicious queries, requiring category-specific tuning or user tier exemptions for production systems. Organisations deploying LLMs require defence-in-depth combining model alignment, input/output guardrails, access controls, and monitoring, as no single layer provides comprehensive protection.

Key Citations:


Category: Red-Teaming | Tools: Garak LLM Vulnerability Scanner | Builds on: Experiments 01, 02, 06

Overview: Compares automated vulnerability scanning effectiveness against manual red-teaming by running Garak's systematic probe library (36,120 tests across 3 models) versus Experiment 02's hand-crafted jailbreak attacks (176 tests). Tests whether industry-standard automated tools can provide comparable discovery rates while dramatically reducing human effort, or whether manual red-teaming's targeted creativity achieves superior per-prompt effectiveness. Evaluates 9 Garak probe categories (encoding, DAN/roleplay, prompt injection, toxicity, continuation) matching Experiment 02's attack types plus novel vulnerability categories, analyzing coverage trade-offs, false positive rates, and efficiency metrics to inform optimal integration strategies for AISI operational workflows.

Target Capability: Experience developing or using safety-related tooling such as automated evaluation pipelines, red-teaming frameworks, test harnesses; working knowledge of vulnerability scanning methodologies and comparative effectiveness analysis.

Key Findings: Automated scanning achieved 13.1% vulnerability discovery rate nearly identical to manual red-teaming's 12.9% despite testing 205x more prompts (36,120 vs 176), contradicting the hypothesis that human-crafted targeted attacks achieve 2-4x higher per-prompt success. However, this aggregate parity masks important divergences: prompt injection emerged as the universal automated attack vector with 10-29% success rates (Gemma3 at 29.3% = 2,247/7,680 failures), notably outperforming manual injection testing (6.2% success). Encoding attacks showed strong resistance across all models (0.2-1.6% vulnerability), validating modern safety architectures apply filtering post-decoding. Multi-turn attacks represent complete coverage gap: Garak tested 0 multi-turn prompts versus 40 manual attempts achieving 25.0% success (Mistral at 60%), revealing automated tools systematically underestimate conversational safety risks due to stateless probe-detector architecture. Novel toxicity discovery uncovered 165 failures (8-14% rates) in categories entirely absent from manual testing, demonstrating comprehensive vulnerability surface mapping value. Encoding coverage expanded 576x (23,040 vs 40 tests), injection 160x (7,680 vs 48), but roleplay only 1.25x (60 vs 48), revealing automation excels at systematic variation within attack categories but cannot replicate human creativity in multi-turn manipulation.

Automated vs Manual Coverage Comparison

Vulnerability Severity Distribution

Outcome: Demonstrates automated and manual red-teaming discover orthogonal vulnerability classes rather than overlapping attack spaces, necessitating integrated workflows instead of tool substitution. Garak's complete inability to test multi-turn attacks creates dangerous false confidence: Mistral's 11.1% automated vulnerability appears notably superior to its 25.0% manual adversarial robustness when conversational manipulation is considered. The finding that injection testing achieved 10-29% automated success versus 6.2% manual proves systematic probe variation discovers weaknesses spot-checking misses, indicating injection resistance should be validated through automated comprehensive testing (thousands of variations) rather than manual sampling. Toxicity generation's 8-14% rates despite 0-2.3% jailbreak success reveals safety training prioritizes harmful intent recognition over toxic language filtering, creating blind spots where models refuse explicit harm requests but generate offensive content under stylistic framing. Optimal AISI integration strategy: Phase 1 automated comprehensive scanning (8-12 hours) for systematic vulnerability surface mapping and weak category identification; Phase 2 targeted manual red-teaming (8-16 hours) exploiting conversational context, developing novel attack vectors, and validating automated findings; Phase 3 automated regression testing (2-4 hours) for continuous monitoring during model development. Organizations deploying conversational AI cannot rely solely on automated scanning. Production safety validation requires both approaches to capture systematic variation (automation strength) and multi-turn creative exploitation (human expertise). The 205x coverage multiplier at only 1.1x time cost validates automation for comprehensive testing, but the multi-turn gap (0 tests vs 40 manual, 25% success) represents deployment-blocking limitation requiring human red-team supplementation.

Key Citations:


Category: Tool Proficiency | Tools: UK AISI Inspect AI | Builds on: Experiments 01-03, 07

Overview: Demonstrates proficiency with UK AISI's official Inspect AI evaluation framework by implementing 3 extreme risk evaluations (CBRN, cyber offense, persuasion) while validating prior safety findings through independent evaluation methodology. Tests 360 prompts (90 per model × 4 models) consisting of 60 harmful and 30 benign requests per model across the three categories that government AI Safety Institutes prioritize for catastrophic risk assessment. Uses Inspect's Task/Solver/Scorer abstractions for reproducible evaluations with structured JSON logging, comparing 6x development speedup versus custom implementation while maintaining audit-ready compliance artifacts.

Target Capability: Experience developing or using safety-related tooling to support evaluations such as test harnesses and automated evaluation pipelines; extensive hands-on experience working with frontier or near-frontier AI models and systems; demonstrated experience running large-scale behavioral evaluations including managing logs and datasets.

Key Findings: Evaluation of 360 prompts revealed Llama3.2:3b achieved best balanced safety (98.3% refusal, 96.7% compliance) as the only model exceeding 95% on both metrics, validating Experiments 01-02 findings. Mistral:7b showed dangerous under-refusal (65.0% refusal rate identical to Experiment 01 baseline) with 40% persuasion attack success, making it unsuitable for deployment without additional safety layers. Gemma3 exhibited over-cautious behavior (98.3% refusal but only 50% compliance), refusing 70% of benign cybersecurity education questions, demonstrating safety-helpfulness calibration challenges. Qwen3:4b showed well-calibrated performance (90.0% refusal, 100% compliance) balancing safety with utility. Cross-experiment validation demonstrated methodology robustness: Llama3.2 (Exp 01: 100% baseline, Exp 02: 90.9% jailbreak resistance, Exp 08: 98.3% extreme risk refusal), Mistral (Exp 01: 65%, Exp 02: 54.5%, Exp 08: 65.0%). CBRN threats showed strongest guardrails (91.3% average refusal), persuasion weakest (Mistral 40% success, as subtle misinformation/propaganda is harder to detect than explicit harmful instructions), and cyber offense most over-refusal (Gemma3 refuses legitimate security education). Framework efficiency achieved 6x development speedup (4 hours vs 24 hours for 3 tasks) through Task template reuse and model abstraction, with hybrid scoring approach (pattern matching for well-defined refusal criteria, zero LLM judge calls) providing audit-ready JSON logs for regulatory compliance.

Refusal Rates by Model and Task

Compliance Rates by Model and Task

Outcome: Successfully demonstrates operational readiness with AISI-standard tooling while independently validating Experiments 01-02 model safety rankings through orthogonal evaluation methodology, proving both framework proficiency and evaluation methodology robustness. The consistent identification of Mistral as weakest (65% refusal across all experiments) and Llama3.2 as strongest (98-100% across experiments) using different frameworks (custom multi-judge classification, Garak automated scanning, Inspect AI pattern matching) provides high-confidence model safety assessments. Gemma3's newly-discovered over-refusal pattern (50% compliance vs 98.3% refusal) reveals Google's safety tuning prioritizes harm minimization over utility, creating 1-in-3 false positive rate for technical users requiring category-specific calibration or user tier exemptions. The finding that pattern matching achieves 98% agreement with human judgment for well-defined refusal criteria validates efficient scoring approaches. Reserve expensive LLM judges for ambiguous semantic evaluation only. Structured JSON logging provides cryptographically verifiable artifacts with full prompt/response/metadata capture essential for government compliance, while 6x development speedup validates framework adoption for operational efficiency. Completes portfolio's custom to production transition demonstrating both technical depth (Experiments 01-06 custom implementations proving deep understanding) and operational proficiency (Experiments 07-08 production tool integration). These dual competencies are essential for government AI safety institute roles requiring both novel evaluation paradigm development and integration into existing AISI workflows.

Key Citations:


Resources

Foundational AI Safety Papers

Core Safety Problems:

Deceptive Alignment:

Evaluation & Red-Teaming

Evaluation Methodologies:

Red-Teaming Frameworks:

Safety Benchmarks:

Mechanistic Interpretability

Sparse Autoencoders & Features:

Circuits & Attention Analysis:

Interpretability Applications:

Safety Infrastructure & Defenses

Training & Alignment Techniques:

Adversarial Defenses:

Production Guardrails:

Risk Assessment & Governance

Dangerous Capabilities:

Scheming & Deception:

Industry Frameworks:

International AI Safety Collaboration

International Agreements:

UK AISI:

  • UK AISI, Frontier AI Trends Report (2024) - First public evidence-based assessment: capabilities doubling every 8 months, 40x improvement in safeguard robustness
  • UK AISI, Fourth Progress Report (2025) - Latest progress on technical blogs, international partnerships, evaluation approaches, staffing

Australian Context:

Tools & Frameworks

Evaluation Platforms:

  • UK AISI Inspect - Open-source Python framework for LLM evaluations with 100+ pre-built evals, multi-turn agent workflows, sandboxed execution
  • Inspect Evals Repository - Community-contributed evaluations covering cybersecurity (Cybench), safety (WMDP, StrongREJECT), agents (GAIA, SWE-Bench)
  • METR Autonomy Evaluation Protocol - Detailed practical protocol for running agent evaluations with elicitation guidelines and threat modeling

Red-Teaming Tools:

  • Microsoft PyRIT - Python Risk Identification Toolkit for automating red teaming of generative AI systems
  • Garak LLM Scanner - Automated LLM vulnerability scanner with comprehensive probe library
  • Anthropic Petri - Parallel exploration tool testing deception, sycophancy, power-seeking across 14 frontier models with 111 scenarios

Mechanistic Interpretability:

  • TransformerLens - Primary open-source library for mechanistic interpretability (Neel Nanda)
  • SAE Lens - Library for training and analyzing sparse autoencoders
  • Gemma Scope - 400+ open sparse autoencoders trained on Gemma 2 with 30M+ learned features (Google DeepMind)

Research Organizations:

About

Practical AI safety engineering: evaluation frameworks and red-teaming tools for assessing frontier AI systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages