Red Teaming

Red Teaming

Capabilities Assessment

Benchmark performance on representative tasks and datasets.
Measure capabilities like accuracy, robustness, efficiency.
Identify strengths, limitations, and gaps.

Adversarial Testing

Probe with malformed, adversarial inputs.
Check for crashes, unintended behavior, security risks.
Informed by threat models, risk analysis.

Red Teaming

Model potential real-world risks and failures.
Role play adversary perspectives.
Surface risks unique to AI.

Human Oversight

Manual test cases based on human judgment.
Qualitative feedback on subtle flaws.
Values alignment evaluation.

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

Users break down impermissible content into small permissible components.
Each component is queried independently and appears harmless.
User recombines components to reconstruct impermissible content.
Exploits compositionality of language.

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
Low-resource languages are those with limited training data, like Zulu.
The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
The attack exploits uneven multilingual training of GPT-4's safety measures.