Human Evaluation Data

The raw (anonymized) human evaluations we obtained can be downloaded in this csv file.

The columns are:

HITId: Integer indexing the row
WorkerId: Unique identifier of the crowd-worker
WorkTimeInSeconds: Amount of time the HIT was open on AMT
Input.idx: Index of the prompt
Input.ctx: Context/prompt that each completion is based upon
Input.model_a: Name of player A
Input.completiona: Completion generated by player A
Input.len_a: Total length (prompt + generation) of player A's text
Input.model_b: Name of player B
Input.completionb: Completion generated by player B
Input.len_b: Total length (prompt + generation) of player B's text
Answer.q1: Answer of crowd-worker to the question: "Which continuation is more interesting or creative, given the context?"
Answer.q2: Answer of crowd-worker to the question: "Which continuation makes more sense, given the context?"
Answer.q3: Answer of crowd-worker to the question: "Which continuation is more likely to be written by a human?"
Answer.te: Our (pessimistic) estimate of the amount of time the crowd-worker took to answer the question.

Key to Answer.q* fields: The responses of the crowd-workers to each question is stored with the following key:

Definitely A: 2a
Slightly A: 1a
Tie: 1a
Slightly B: 1b
Definitely B: 2b

Note that both "Tie" and "Slightly A" are recorded as 1a. Since for each pair, the choice of A versus B is randomized, this amounts to randomly assigning each tie as a win to one of the two players.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

human_evaluation.md

human_evaluation.md

Human Evaluation Data

Files

human_evaluation.md

Latest commit

History

human_evaluation.md

File metadata and controls

Human Evaluation Data