Implement LLM-as-a-judge pairwise evaluation for YESciEval

The current version of the LLM-as-a-judge software for all rubrics, supports the following input format. An input question, a set of paper titles and abstracts, and answer. These work fine within the single data point evaluation paradigm and they are more broadly known as the `pointwise` evaluation paradigm. See the description [here](https://github.com/sciknoworg/YESciEval/issues/7).

Now the library needs to be extended in two ways:
1. the structural organization of the code has to change. For example, import statements should change from the former, as an example of one rubric: `from yescieval import Informativeness` to instead `from yescieval import pointwise.Informativeness` or `from yescieval import pairwise.Informativeness`. This leads to the next point.
2. All rubrics have to be implemented as a pairwise counterpart. This would mean new descriptions for each metric.

**Background**

1. Single (pointwise) judging. You give the judge one candidate output (plus the task context), and it returns an absolute score (e.g., 1–5 Likert, 1–10) and often a rationale. This is the style is implemented in YESciEval (rubric → Likert score + rationale) as of the date of creation of this issue. [1] The G-Eval work referenced below is one of the early exemplary work to do pointwise evaluation even though they explicitly name the paradigm as pointwise.
2. Pairwise judging. You give the judge two candidate outputs for the same input/context, and it returns a preference (A wins / B wins / tie), sometimes with a margin and/or rubric-level justifications. Early LLM-as-a-judge work emphasized exactly this “pairwise preference” framing, i.e., “which response is superior,” before (and alongside) richer rubric scorecards. See [2].

**References**
1. [G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment](https://aclanthology.org/2023.emnlp-main.153/) (Liu et al., EMNLP 2023)
2. [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://dl.acm.org/doi/10.5555/3666122.3668142) (Zheng et al., NeurIPS 2023)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LLM-as-a-judge pairwise evaluation for YESciEval #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement LLM-as-a-judge pairwise evaluation for YESciEval #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions