Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
## Changelog

### v0.4.0 (January 13, 2026)
- Add a GPT custom jude (PR #5)
- Update documentation
- Minor bug fixing in deep research rubrics and judges
- Update Readme

### v0.3.0 (December 20, 2025)
- Add more rubrics (PR #3)
- Update documentation for new rubrics
Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ judge.from_pretrained(
)

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
result = judge.judge(rubric=rubric)
print("Raw Evaluation Output:")
print(result)
```
Expand All @@ -87,7 +87,9 @@ Judges within YESciEval are defined as follows:
| `AutoJudge` | Base class for loading and running evaluation models with PEFT adapters. |
| `AskAutoJudge` | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
| `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. |
| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics |
| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics |
| `GPTCustomAutoJudge`| Custom GPT-based LLM that can be used as a judge within YESciEval |


A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:

Expand All @@ -96,9 +98,8 @@ from yescieval import Informativeness, Correctness, Completeness, Coherence, Rel
Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \
SpeculativeStatements, NoveltyIndicators

StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \
SpeculativeStatements, NoveltyIndicators
```

A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.
Expand Down
33 changes: 31 additions & 2 deletions docs/source/judges.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The following example demonstrates how to create an evaluation rubric, load a ju
device="cpu")

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
result = judge.judge(rubric=rubric)

print("Raw Evaluation Output:")
print(result)
Expand Down Expand Up @@ -84,8 +84,37 @@ For example, you can load a model and evaluate a rubric like this:
judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")

# Evaluate the rubric using the loaded model
result = judge.evaluate(rubric=rubric)
result = judge.judge(rubric=rubric)

print(result)

This approach allows full control over which model is used for evaluation, supporting any LLM..

GPT Custom Judge
--------------------

The `GPTCustomAutoJudge` class provides a generic, flexible interface to evaluate scientific syntheses using OpenAI GPT models.

You can use it to evaluate a rubric by providing your OpenAI API key and specifying the model ID:

.. code-block:: python

# Initialize and load a custom model by specifying its Hugging Face model ID
judge = GPTCustomAutoJudge()
judge.from_pretrained(model_id="gpt-5.2", token=OPEN_AI_API_KEY)

# Evaluate the rubric using the loaded model
result = judge.judge(rubric=rubric)

print(result.model_dump())

as a result output will be in the following format

.. code-block:: json

{
"rating": rating-value,
"rationale": "rationale-text"
}

This allows you to leverage the capabilities of OpenAI's GPT models for scientific text evaluation.
30 changes: 27 additions & 3 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
judge.from_pretrained(token="your_huggingface_token", device="cpu")

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
result = judge.judge(rubric=rubric)

print("Raw Evaluation Output:")
print(result)
Expand All @@ -62,7 +62,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
result = judge.judge(rubric=rubric)
print("Raw Evaluation Output:")
print(result)

Expand All @@ -81,7 +81,7 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar
parsed = parser.parse(raw_output=raw_output)

print("Parsed Output:")
print(parsed)
print(parsed.model_dump())

**Expected Output Format**

Expand All @@ -92,6 +92,30 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar
"rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine."
}

The output schema is as a following (if you do not prefer to use ``.model_dump()``) to be able to use like ``result.rating`` to access the rating value or ``result.rationale`` to access the textual explanation for rating.

.. code-block::

{
'properties': {
'rating': {
'description': 'Rating from 1 to 5',
'maximum': 5,
'minimum': 1,
'title': 'Rating',
'type': 'integer'
},
'rationale': {
'description': 'Textual explanation for the rating',
'title': 'Rationale',
'type': 'string'
}
},
'required': ['rating', 'rationale'],
'title': 'RubricLikertScale',
'type': 'object'
}

.. hint:: Key Components

+------------------+-------------------------------------------------------+
Expand Down
1 change: 1 addition & 0 deletions docs/source/rubrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,3 +188,4 @@ And to use rubrics:
instruction = rubric.instruct()

print(instruction)
print(rubric.name)
2 changes: 1 addition & 1 deletion yescieval/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.3.0
0.4.0
2 changes: 1 addition & 1 deletion yescieval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
SpeculativeStatements, NoveltyIndicators)
from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge, GPTCustomAutoJudge
from .parser import GPTParser

4 changes: 2 additions & 2 deletions yescieval/base/judge.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
from abc import ABC
from typing import Dict, Any
from . import Parser, Rubric
from . import Rubric, RubricLikertScale


class Judge(ABC):

def from_pretrained(self, model_id:str, device: str="auto", token:str =""):
self.model, self.tokenizer = self._from_pretrained(model_id=model_id, device=device, token=token)

def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]] | str | RubricLikertScale:
pass

def _from_pretrained(self, model_id: str, device: str = "auto", token: str = "") -> [Any, Any]:
Expand Down
1 change: 1 addition & 0 deletions yescieval/base/rubric.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ class Rubric(BaseModel, ABC):
Subclasses must implement `verbalize`.
"""
system_prompt_template: str
name: str = "Rubric"
papers: Dict[str, str]
question: str
answer: str
Expand Down
6 changes: 4 additions & 2 deletions yescieval/judge/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge
from .custom import CustomAutoJudge, GPTCustomAutoJudge

__all__ = [
"AutoJudge",
"AskAutoJudge",
"BioASQAutoJudge",
"CustomAutoJudge"
"CustomAutoJudge",
"GPTCustomAutoJudge"
]
97 changes: 97 additions & 0 deletions yescieval/judge/custom.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from ..base import Judge, Rubric, RubricLikertScale
from .judges import AutoJudge

import time
from typing import Dict, List
from openai import OpenAI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import logging

logger = logging.getLogger(__name__)

class CustomAutoJudge(AutoJudge):

def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
tokenizer = AutoTokenizer.from_pretrained(model_id,
padding_side="left",
token=token)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float32,
device_map=device,
token=token
)
return model, tokenizer


class GPTCustomAutoJudge(Judge):

def from_pretrained(self, model_id: str, device: str = "auto", token: str = ""):
if not token:
raise ValueError("OpenAI API token must be provided.")
self.model_name = model_id
self.client = OpenAI(api_key=token)

def _supports_function_calling(self) -> bool:
gpt_4_prefixes = (
"gpt-4", # gpt4 family including gpt-4o, gpt-4o-mini, gpt-4.1, ...
"GPT-3.5", # gpt-3.5 family
)
return any(self.model_name.startswith(prefix) for prefix in gpt_4_prefixes)

def _output_schema(self) -> List[Dict]:
return [
{
"name": "response_format",
"description": f"Return the `rating` and `rationale` only as a response.",
"parameters": {
"type": "object",
"properties": {
'rating': {
"type": "number",
"description": "A numerical rating assigned to the characteristic.",
"minimum": 1,
"maximum": 5
},
"rationale": {
"type": "string",
"description": "The explanation for the assigned rating."
},
},
"required": ["rating", "rationale"]
}
}
]

def judge(self, rubric: Rubric, max_new_tokens: int = 150) -> RubricLikertScale:
if not self.client:
raise ValueError("Model not initialized.")
messages = rubric.instruct()
params = {
"model": self.model_name,
"messages": messages
}
if self._supports_function_calling():
params["functions"] = self._output_schema()

try_counter = 0
while True:
try:
try_counter += 1
response = self.client.chat.completions.create(**params)
message = response.choices[0].message
if self._supports_function_calling():
parsed_output = eval(message.function_call.arguments)
else:
parsed_output = eval(message.content)[rubric.name]
evaluation = RubricLikertScale(rating=parsed_output['rating'], rationale=parsed_output['rationale'])
return evaluation

except Exception as e:
logger.error(f"{try_counter} times failed attempt!")
logger.warning(f"API call failed, retrying in 4 seconds: {e}")
time.sleep(5)


21 changes: 3 additions & 18 deletions yescieval/judge/judges.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch
import logging

logger = logging.getLogger(__name__)


class AutoJudge(Judge):
Expand All @@ -25,7 +27,7 @@ def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
model = PeftModel.from_pretrained(base_model, model_id)
return model, tokenizer

def evaluate(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
def judge(self, rubric: Rubric, max_new_tokens: int=150) -> str:
inputs = self.tokenizer.apply_chat_template(rubric.instruct(),
add_generation_prompt=True,
return_dict=True,
Expand All @@ -49,20 +51,3 @@ def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1
device: str = "auto",
token: str = ""):
self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)



class CustomAutoJudge(AutoJudge):

def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
tokenizer = AutoTokenizer.from_pretrained(model_id,
padding_side="left",
token=token)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float32,
device_map=device,
token=token
)
return model, tokenizer
Loading