Each problem is represented as a combination of a pair of corpora and a research goal. See below for an example.
`pairs.yaml' contains most of the metadata:
- The highest level of structure is each dataset's abbreviation (e.g.
abc_headlines
) - Each dataset contains several generations, which describe different ways of splitting a dataset.
- A generation description is comprised of the splitting feature (e.g.
the year they were published
). - Applications are comprised of a
target
anduser
with someexample_hypotheses
. - The
v2-origid
field is for internal tracking and is non-essential. - If the context should allow for any kind of hypothesis, the
purely_exploratory
flag is set toTrue
. - The
pair_type
is assigned according to the taxonomy outlined below. - If a
flip
is appropriate, the user should consider both the original and swapped (B vs. A) versions of the problem.
- A generation description is comprised of the splitting feature (e.g.
- Each generation contains a list of
pairs
.- Each pair has
pos_desc
andneg_desc
(e.g.are ABC news headlines from 2007
) - The classes correspond to distribution names.
- Each pair has
dataset.yaml
contains dataset-level features:
- A description of the types of text samples, e.g.
headlines published by ABC news, an American news company
. - The
discipline
andexpertise
required for the dataset. - The
status
of the dataset, which will be:public
if there is a public license.private
if the dataset was privately shared.accessible
if the dataset can be accessed easily but has unclear license.
- The preprocessing steps, which are mostly for internal tracking.