Skip to content

Implement LLM-as-a-judge pairwise evaluation for YESciEval #8

@jd-coderepos

Description

@jd-coderepos

The current version of the LLM-as-a-judge software for all rubrics, supports the following input format. An input question, a set of paper titles and abstracts, and answer. These work fine within the single data point evaluation paradigm and they are more broadly known as the pointwise evaluation paradigm. See the description here.

Now the library needs to be extended in two ways:

  1. the structural organization of the code has to change. For example, import statements should change from the former, as an example of one rubric: from yescieval import Informativeness to instead from yescieval import pointwise.Informativeness or from yescieval import pairwise.Informativeness. This leads to the next point.
  2. All rubrics have to be implemented as a pairwise counterpart. This would mean new descriptions for each metric.

Background

  1. Single (pointwise) judging. You give the judge one candidate output (plus the task context), and it returns an absolute score (e.g., 1–5 Likert, 1–10) and often a rationale. This is the style is implemented in YESciEval (rubric → Likert score + rationale) as of the date of creation of this issue. [1] The G-Eval work referenced below is one of the early exemplary work to do pointwise evaluation even though they explicitly name the paradigm as pointwise.
  2. Pairwise judging. You give the judge two candidate outputs for the same input/context, and it returns a preference (A wins / B wins / tie), sometimes with a margin and/or rubric-level justifications. Early LLM-as-a-judge work emphasized exactly this “pairwise preference” framing, i.e., “which response is superior,” before (and alongside) richer rubric scorecards. See [2].

References

  1. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment (Liu et al., EMNLP 2023)
  2. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., NeurIPS 2023)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions