Skip to content

NeurIPS: human evaluation — validate LLM judge with annotator agreement #29

@hdviettt

Description

@hdviettt

Goal

Get 2 human annotators to judge 100 QA pairs from LME-M. Compute Cohen's kappa between humans and our LLM judge. This validates the 52.6% number.

Why

Claude judging Claude's answers is circular. NeurIPS reviewers will flag this. Human evaluation is the gold standard. Even a small sample (100 questions) with inter-annotator agreement is sufficient.

Protocol

  1. Sample 100 questions from LME-M (stratified across question types)
  2. Present each annotator with: question, gold answer, predicted answer
  3. Annotator marks: correct / incorrect / partially correct
  4. Compute: human accuracy, Cohen's kappa (human-human), kappa (human-LLM judge)

Acceptance criteria

  • Cohen's kappa >= 0.7 (human vs LLM judge)
  • If below: report human numbers as primary, LLM judge as secondary

Steps

  1. Export 100 QA pairs to a spreadsheet
  2. Find 2 annotators (can be co-authors or colleagues)
  3. Independent rating
  4. Compute agreement statistics
  5. Add to paper as validation of evaluation methodology

Cost

Human time only (~2-3 hours per annotator)

Metadata

Metadata

Assignees

No one assigned

    Labels

    evalEvaluation pipeline

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions