- add Pythia
- add few shot evaluation
- https://arxiv.org/pdf/2310.06824.pdf
- Benchmark performance on representative tasks and datasets.
- Measure capabilities like accuracy, robustness, efficiency.
- Identify strengths, limitations, and gaps.
- Probe with malformed, adversarial inputs.
- Check for crashes, unintended behavior, security risks.
- Informed by threat models, risk analysis.
- Model potential real-world risks and failures.
- Role play adversary perspectives.
- Surface risks unique to AI.
- Manual test cases based on human judgment.
- Qualitative feedback on subtle flaws.
- Values alignment evaluation.
Mosaic Prompt : breakdown a prompt into permissible components
- Users break down impermissible content into small permissible components.
- Each component is queried independently and appears harmless.
- User recombines components to reconstruct impermissible content.
- Exploits compositionality of language.
- The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
- Low-resource languages are those with limited training data, like Zulu.
- The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
- The attack exploits uneven multilingual training of GPT-4's safety measures.