The raw (anonymized) human evaluations we obtained can be downloaded in this csv file.
The columns are:
HITId
: Integer indexing the rowWorkerId
: Unique identifier of the crowd-workerWorkTimeInSeconds
: Amount of time the HIT was open on AMTInput.idx
: Index of the promptInput.ctx
: Context/prompt that each completion is based uponInput.model_a
: Name of player AInput.completiona
: Completion generated by player AInput.len_a
: Total length (prompt + generation) of player A's textInput.model_b
: Name of player BInput.completionb
: Completion generated by player BInput.len_b
: Total length (prompt + generation) of player B's textAnswer.q1
: Answer of crowd-worker to the question: "Which continuation is more interesting or creative, given the context?"Answer.q2
: Answer of crowd-worker to the question: "Which continuation makes more sense, given the context?"Answer.q3
: Answer of crowd-worker to the question: "Which continuation is more likely to be written by a human?"Answer.te
: Our (pessimistic) estimate of the amount of time the crowd-worker took to answer the question.
Key to Answer.q*
fields: The responses of the crowd-workers to each question is stored with the following key:
- Definitely A: 2a
- Slightly A: 1a
- Tie: 1a
- Slightly B: 1b
- Definitely B: 2b
Note that both "Tie" and "Slightly A" are recorded as 1a
. Since for each pair, the choice of A versus B is randomized, this amounts to randomly assigning each tie as a win to one of the two players.