Skip to content

Latest commit

 

History

History
31 lines (26 loc) · 1.63 KB

human_evaluation.md

File metadata and controls

31 lines (26 loc) · 1.63 KB

Human Evaluation Data

The raw (anonymized) human evaluations we obtained can be downloaded in this csv file.

The columns are:

  • HITId: Integer indexing the row
  • WorkerId: Unique identifier of the crowd-worker
  • WorkTimeInSeconds: Amount of time the HIT was open on AMT
  • Input.idx: Index of the prompt
  • Input.ctx: Context/prompt that each completion is based upon
  • Input.model_a: Name of player A
  • Input.completiona: Completion generated by player A
  • Input.len_a: Total length (prompt + generation) of player A's text
  • Input.model_b: Name of player B
  • Input.completionb: Completion generated by player B
  • Input.len_b: Total length (prompt + generation) of player B's text
  • Answer.q1: Answer of crowd-worker to the question: "Which continuation is more interesting or creative, given the context?"
  • Answer.q2: Answer of crowd-worker to the question: "Which continuation makes more sense, given the context?"
  • Answer.q3: Answer of crowd-worker to the question: "Which continuation is more likely to be written by a human?"
  • Answer.te: Our (pessimistic) estimate of the amount of time the crowd-worker took to answer the question.

Key to Answer.q* fields: The responses of the crowd-workers to each question is stored with the following key:

  • Definitely A: 2a
  • Slightly A: 1a
  • Tie: 1a
  • Slightly B: 1b
  • Definitely B: 2b

Note that both "Tie" and "Slightly A" are recorded as 1a. Since for each pair, the choice of A versus B is randomized, this amounts to randomly assigning each tie as a win to one of the two players.