Goal
Get 2 human annotators to judge 100 QA pairs from LME-M. Compute Cohen's kappa between humans and our LLM judge. This validates the 52.6% number.
Why
Claude judging Claude's answers is circular. NeurIPS reviewers will flag this. Human evaluation is the gold standard. Even a small sample (100 questions) with inter-annotator agreement is sufficient.
Protocol
- Sample 100 questions from LME-M (stratified across question types)
- Present each annotator with: question, gold answer, predicted answer
- Annotator marks: correct / incorrect / partially correct
- Compute: human accuracy, Cohen's kappa (human-human), kappa (human-LLM judge)
Acceptance criteria
- Cohen's kappa >= 0.7 (human vs LLM judge)
- If below: report human numbers as primary, LLM judge as secondary
Steps
- Export 100 QA pairs to a spreadsheet
- Find 2 annotators (can be co-authors or colleagues)
- Independent rating
- Compute agreement statistics
- Add to paper as validation of evaluation methodology
Cost
Human time only (~2-3 hours per annotator)
Goal
Get 2 human annotators to judge 100 QA pairs from LME-M. Compute Cohen's kappa between humans and our LLM judge. This validates the 52.6% number.
Why
Claude judging Claude's answers is circular. NeurIPS reviewers will flag this. Human evaluation is the gold standard. Even a small sample (100 questions) with inter-annotator agreement is sufficient.
Protocol
Acceptance criteria
Steps
Cost
Human time only (~2-3 hours per annotator)