-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
The current version of the LLM-as-a-judge software for all rubrics, supports the following input format. An input question, a set of paper titles and abstracts, and answer. These work fine within the single data point evaluation paradigm and they are more broadly known as the pointwise evaluation paradigm. See the description here.
Now the library needs to be extended in two ways:
- the structural organization of the code has to change. For example, import statements should change from the former, as an example of one rubric:
from yescieval import Informativenessto insteadfrom yescieval import pointwise.Informativenessorfrom yescieval import pairwise.Informativeness. This leads to the next point. - All rubrics have to be implemented as a pairwise counterpart. This would mean new descriptions for each metric.
Background
- Single (pointwise) judging. You give the judge one candidate output (plus the task context), and it returns an absolute score (e.g., 1–5 Likert, 1–10) and often a rationale. This is the style is implemented in YESciEval (rubric → Likert score + rationale) as of the date of creation of this issue. [1] The G-Eval work referenced below is one of the early exemplary work to do pointwise evaluation even though they explicitly name the paradigm as pointwise.
- Pairwise judging. You give the judge two candidate outputs for the same input/context, and it returns a preference (A wins / B wins / tie), sometimes with a margin and/or rubric-level justifications. Early LLM-as-a-judge work emphasized exactly this “pairwise preference” framing, i.e., “which response is superior,” before (and alongside) richer rubric scorecards. See [2].
References
- G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment (Liu et al., EMNLP 2023)
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., NeurIPS 2023)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request