NeurIPS: human evaluation — validate LLM judge with annotator agreement

## Goal
Get 2 human annotators to judge 100 QA pairs from LME-M. Compute Cohen's kappa between humans and our LLM judge. This validates the 52.6% number.

## Why
Claude judging Claude's answers is circular. NeurIPS reviewers will flag this. Human evaluation is the gold standard. Even a small sample (100 questions) with inter-annotator agreement is sufficient.

## Protocol
1. Sample 100 questions from LME-M (stratified across question types)
2. Present each annotator with: question, gold answer, predicted answer
3. Annotator marks: correct / incorrect / partially correct
4. Compute: human accuracy, Cohen's kappa (human-human), kappa (human-LLM judge)

## Acceptance criteria
- Cohen's kappa >= 0.7 (human vs LLM judge)
- If below: report human numbers as primary, LLM judge as secondary

## Steps
1. Export 100 QA pairs to a spreadsheet
2. Find 2 annotators (can be co-authors or colleagues)
3. Independent rating
4. Compute agreement statistics
5. Add to paper as validation of evaluation methodology

## Cost
Human time only (~2-3 hours per annotator)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeurIPS: human evaluation — validate LLM judge with annotator agreement #29

Goal

Why

Protocol

Acceptance criteria

Steps

Cost

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NeurIPS: human evaluation — validate LLM judge with annotator agreement #29

Description

Goal

Why

Protocol

Acceptance criteria

Steps

Cost

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions