sciknoworg · HamedBabaei · Jan 13, 2026 · Jan 6, 2026 · Jan 6, 2026 · Jan 12, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 ## Changelog
 
+### v0.4.0 (January 13, 2026)
+- Add a GPT custom jude (PR #5)
+- Update documentation 
+- Minor bug fixing in deep research rubrics and judges
+- Update Readme
+
 ### v0.3.0 (December 20, 2025)
 - Add more rubrics (PR #3)
 - Update documentation for new rubrics

diff --git a/README.md b/README.md
@@ -75,7 +75,7 @@ judge.from_pretrained(
 )
 
 # Step 3: Evaluate the answer
-result = judge.evaluate(rubric=rubric)
+result = judge.judge(rubric=rubric)
 print("Raw Evaluation Output:")
 print(result)
 ```
@@ -87,7 +87,9 @@ Judges within YESciEval are defined as follows:
 | `AutoJudge`      | Base class for loading and running evaluation models with PEFT adapters.                     |
 | `AskAutoJudge`   | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
 | `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge.               |
-| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics           |
+| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics                              |
+| `GPTCustomAutoJudge`| Custom GPT-based LLM that can be used as a judge within YESciEval                     |
+
 
 A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
 
@@ -96,9 +98,8 @@ from yescieval import Informativeness, Correctness, Completeness, Coherence, Rel
                       Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \ 
                       InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \
                       MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \
-                      StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \ 
-                      SpeculativeStatements, NoveltyIndicators
-
+                      StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \
+                      SpeculativeStatements, NoveltyIndicators 
 ```
 
 A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.

diff --git a/docs/source/judges.rst b/docs/source/judges.rst
@@ -48,7 +48,7 @@ The following example demonstrates how to create an evaluation rubric, load a ju
                           device="cpu")
 
     # Step 3: Evaluate the answer
-    result = judge.evaluate(rubric=rubric)
+    result = judge.judge(rubric=rubric)
 
     print("Raw Evaluation Output:")
     print(result)
@@ -84,8 +84,37 @@ For example, you can load a model and evaluate a rubric like this:
     judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
 
     # Evaluate the rubric using the loaded model
-    result = judge.evaluate(rubric=rubric)
+    result = judge.judge(rubric=rubric)
 
     print(result)
 
 This approach allows full control over which model is used for evaluation, supporting any LLM..
+
+GPT Custom Judge
+--------------------
+
+The `GPTCustomAutoJudge` class provides a generic, flexible interface to evaluate scientific syntheses using OpenAI GPT models. 
+
+You can use it to evaluate a rubric by providing your OpenAI API key and specifying the model ID:
+
+.. code-block:: python
+
+    # Initialize and load a custom model by specifying its Hugging Face model ID
+    judge = GPTCustomAutoJudge()
+    judge.from_pretrained(model_id="gpt-5.2", token=OPEN_AI_API_KEY)
+
+    # Evaluate the rubric using the loaded model
+    result = judge.judge(rubric=rubric)
+
+    print(result.model_dump())
+
+as a result output will be in the following format
+
+.. code-block:: json
+
+   {
+     "rating": rating-value,
+     "rationale": "rationale-text"
+   }
+
+This allows you to leverage the capabilities of OpenAI's GPT models for scientific text evaluation.
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -35,7 +35,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
    judge.from_pretrained(token="your_huggingface_token", device="cpu")
 
    # Step 3: Evaluate the answer
-   result = judge.evaluate(rubric=rubric)
+   result = judge.judge(rubric=rubric)
 
    print("Raw Evaluation Output:")
    print(result)
@@ -62,7 +62,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
    judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
 
    # Step 3: Evaluate the answer
-   result = judge.evaluate(rubric=rubric)
+   result = judge.judge(rubric=rubric)
    print("Raw Evaluation Output:")
    print(result)
 
@@ -81,7 +81,7 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar
    parsed = parser.parse(raw_output=raw_output)
 
    print("Parsed Output:")
-   print(parsed)
+   print(parsed.model_dump())
 
 **Expected Output Format**
 
@@ -92,6 +92,30 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar
      "rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine."
    }
 
+The output schema is as a following (if you do not prefer to use ``.model_dump()``) to be able to use like ``result.rating`` to access the rating value or ``result.rationale`` to access the textual explanation for rating.
+
+.. code-block::
+
+	{
+		'properties': {
+			'rating': {
+				'description': 'Rating from 1 to 5',
+				'maximum': 5,
+				'minimum': 1,
+				'title': 'Rating',
+				'type': 'integer'
+			},
+			'rationale': {
+				'description': 'Textual explanation for the rating',
+				'title': 'Rationale',
+				'type': 'string'
+			}
+		},
+		'required': ['rating', 'rationale'],
+		'title': 'RubricLikertScale',
+		'type': 'object'
+	}
+
 .. hint:: Key Components
 
     +------------------+-------------------------------------------------------+

diff --git a/docs/source/rubrics.rst b/docs/source/rubrics.rst
@@ -188,3 +188,4 @@ And to use rubrics:
     instruction = rubric.instruct()
 
     print(instruction)
+    print(rubric.name)
diff --git a/yescieval/VERSION b/yescieval/VERSION
@@ -1 +1 @@
-0.3.0
+0.4.0
diff --git a/yescieval/__init__.py b/yescieval/__init__.py
@@ -9,6 +9,6 @@
                     MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, 
                     StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, 
                     SpeculativeStatements, NoveltyIndicators)
-from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
+from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge, GPTCustomAutoJudge
 from .parser import GPTParser
 
diff --git a/yescieval/base/judge.py b/yescieval/base/judge.py
@@ -1,14 +1,14 @@
 from abc import ABC
 from typing import Dict, Any
-from . import Parser, Rubric
+from . import Rubric, RubricLikertScale
 
 
 class Judge(ABC):
 
     def from_pretrained(self, model_id:str, device: str="auto", token:str =""):
         self.model, self.tokenizer = self._from_pretrained(model_id=model_id, device=device, token=token)
 
-    def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
+    def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]] | str | RubricLikertScale:
         pass
 
     def _from_pretrained(self, model_id: str, device: str = "auto", token: str = "") -> [Any, Any]:

diff --git a/yescieval/base/rubric.py b/yescieval/base/rubric.py
@@ -10,6 +10,7 @@ class Rubric(BaseModel, ABC):
     Subclasses must implement `verbalize`.
     """
     system_prompt_template: str
+    name: str = "Rubric"
     papers: Dict[str, str]
     question: str
     answer: str

diff --git a/yescieval/judge/__init__.py b/yescieval/judge/__init__.py
@@ -1,8 +1,10 @@
-from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
+from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge
+from .custom import CustomAutoJudge, GPTCustomAutoJudge
 
 __all__ = [
     "AutoJudge",
     "AskAutoJudge",
     "BioASQAutoJudge",
-    "CustomAutoJudge"
+    "CustomAutoJudge",
+    "GPTCustomAutoJudge"
 ]
diff --git a/yescieval/judge/custom.py b/yescieval/judge/custom.py
@@ -0,0 +1,97 @@
+from ..base import Judge, Rubric, RubricLikertScale
+from .judges import AutoJudge
+
+import time
+from typing import Dict, List
+from openai import OpenAI
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+import logging
+
+logger = logging.getLogger(__name__)
+
+class CustomAutoJudge(AutoJudge):
+
+    def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
+        tokenizer = AutoTokenizer.from_pretrained(model_id,
+                                                  padding_side="left",
+                                                  token=token)
+        tokenizer.pad_token = tokenizer.eos_token
+        model = AutoModelForCausalLM.from_pretrained(
+            model_id,
+            torch_dtype=torch.float32,
+            device_map=device,
+            token=token
+        )
+        return model, tokenizer
+
+
+class GPTCustomAutoJudge(Judge):
+
+    def from_pretrained(self, model_id: str, device: str = "auto", token: str = ""):
+        if not token:
+            raise ValueError("OpenAI API token must be provided.")
+        self.model_name = model_id
+        self.client = OpenAI(api_key=token)
+
+    def _supports_function_calling(self) -> bool:
+        gpt_4_prefixes = (
+            "gpt-4", # gpt4 family including gpt-4o, gpt-4o-mini, gpt-4.1, ...
+            "GPT-3.5", # gpt-3.5 family
+        )
+        return any(self.model_name.startswith(prefix) for prefix in gpt_4_prefixes)
+
+    def _output_schema(self) -> List[Dict]:
+        return [
+            {
+                "name": "response_format",
+                "description": f"Return the `rating` and `rationale` only as a response.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        'rating': {
+                                "type": "number",
+                                "description": "A numerical rating assigned to the characteristic.",
+                                "minimum": 1,
+                                "maximum": 5
+                        },
+                        "rationale": {
+                            "type": "string",
+                            "description": "The explanation for the assigned rating."
+                        },
+                    },
+                    "required": ["rating", "rationale"]
+                }
+            }
+        ]
+
+    def judge(self, rubric: Rubric, max_new_tokens: int = 150) -> RubricLikertScale:
+        if not self.client:
+            raise ValueError("Model not initialized.")
+        messages = rubric.instruct()
+        params = {
+            "model": self.model_name,
+            "messages": messages
+        }
+        if self._supports_function_calling():
+            params["functions"] = self._output_schema()
+
+        try_counter = 0
+        while True:
+            try:
+                try_counter += 1
+                response = self.client.chat.completions.create(**params)
+                message = response.choices[0].message
+                if self._supports_function_calling():
+                    parsed_output = eval(message.function_call.arguments)
+                else:
+                    parsed_output = eval(message.content)[rubric.name]
+                evaluation = RubricLikertScale(rating=parsed_output['rating'], rationale=parsed_output['rationale'])
+                return evaluation
+
+            except Exception as e:
+                logger.error(f"{try_counter} times failed attempt!")
+                logger.warning(f"API call failed, retrying in 4 seconds: {e}")
+                time.sleep(5)
+
+
diff --git a/yescieval/judge/judges.py b/yescieval/judge/judges.py
@@ -4,7 +4,9 @@
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from peft import PeftModel, PeftConfig
 import torch
+import logging
 
+logger = logging.getLogger(__name__)
 
 
 class AutoJudge(Judge):
@@ -25,7 +27,7 @@ def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
         model = PeftModel.from_pretrained(base_model, model_id)
         return model, tokenizer
 
-    def evaluate(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
+    def judge(self, rubric: Rubric, max_new_tokens: int=150) -> str:
         inputs = self.tokenizer.apply_chat_template(rubric.instruct(),
                                                     add_generation_prompt=True,
                                                     return_dict=True,
@@ -49,20 +51,3 @@ def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1
                          device: str = "auto",
                          token: str = ""):
         self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)
-
-
-
-class CustomAutoJudge(AutoJudge):
-
-    def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
-        tokenizer = AutoTokenizer.from_pretrained(model_id,
-                                                  padding_side="left",
-                                                  token=token)
-        tokenizer.pad_token = tokenizer.eos_token
-        model = AutoModelForCausalLM.from_pretrained(
-            model_id,
-            torch_dtype=torch.float32,
-            device_map=device,
-            token=token
-        )
-        return model, tokenizer
Original file line number	Diff line number	Diff line change
Expand Up		@@ -188,3 +188,4 @@ And to use rubrics:
		instruction = rubric.instruct()

		print(instruction)
		print(rubric.name)