diff --git a/CHANGELOG.md b/CHANGELOG.md index 262c769..1bb7366 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,11 @@ ## Changelog +### v0.4.0 (January 13, 2026) +- Add a GPT custom jude (PR #5) +- Update documentation +- Minor bug fixing in deep research rubrics and judges +- Update Readme + ### v0.3.0 (December 20, 2025) - Add more rubrics (PR #3) - Update documentation for new rubrics diff --git a/README.md b/README.md index faba342..f00a219 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ judge.from_pretrained( ) # Step 3: Evaluate the answer -result = judge.evaluate(rubric=rubric) +result = judge.judge(rubric=rubric) print("Raw Evaluation Output:") print(result) ``` @@ -87,7 +87,9 @@ Judges within YESciEval are defined as follows: | `AutoJudge` | Base class for loading and running evaluation models with PEFT adapters. | | `AskAutoJudge` | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. | | `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. | -| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics | +| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics | +| `GPTCustomAutoJudge`| Custom GPT-based LLM that can be used as a judge within YESciEval | + A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code: @@ -96,9 +98,8 @@ from yescieval import Informativeness, Correctness, Completeness, Coherence, Rel Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \ InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \ MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \ - StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \ - SpeculativeStatements, NoveltyIndicators - + StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \ + SpeculativeStatements, NoveltyIndicators ``` A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page. diff --git a/docs/source/judges.rst b/docs/source/judges.rst index 0eb37af..14621a3 100644 --- a/docs/source/judges.rst +++ b/docs/source/judges.rst @@ -48,7 +48,7 @@ The following example demonstrates how to create an evaluation rubric, load a ju device="cpu") # Step 3: Evaluate the answer - result = judge.evaluate(rubric=rubric) + result = judge.judge(rubric=rubric) print("Raw Evaluation Output:") print(result) @@ -84,8 +84,37 @@ For example, you can load a model and evaluate a rubric like this: judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token") # Evaluate the rubric using the loaded model - result = judge.evaluate(rubric=rubric) + result = judge.judge(rubric=rubric) print(result) This approach allows full control over which model is used for evaluation, supporting any LLM.. + +GPT Custom Judge +-------------------- + +The `GPTCustomAutoJudge` class provides a generic, flexible interface to evaluate scientific syntheses using OpenAI GPT models. + +You can use it to evaluate a rubric by providing your OpenAI API key and specifying the model ID: + +.. code-block:: python + + # Initialize and load a custom model by specifying its Hugging Face model ID + judge = GPTCustomAutoJudge() + judge.from_pretrained(model_id="gpt-5.2", token=OPEN_AI_API_KEY) + + # Evaluate the rubric using the loaded model + result = judge.judge(rubric=rubric) + + print(result.model_dump()) + +as a result output will be in the following format + +.. code-block:: json + + { + "rating": rating-value, + "rationale": "rationale-text" + } + +This allows you to leverage the capabilities of OpenAI's GPT models for scientific text evaluation. \ No newline at end of file diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst index fdbb3e3..8f7fbdc 100644 --- a/docs/source/quickstart.rst +++ b/docs/source/quickstart.rst @@ -35,7 +35,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi judge.from_pretrained(token="your_huggingface_token", device="cpu") # Step 3: Evaluate the answer - result = judge.evaluate(rubric=rubric) + result = judge.judge(rubric=rubric) print("Raw Evaluation Output:") print(result) @@ -62,7 +62,7 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token") # Step 3: Evaluate the answer - result = judge.evaluate(rubric=rubric) + result = judge.judge(rubric=rubric) print("Raw Evaluation Output:") print(result) @@ -81,7 +81,7 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar parsed = parser.parse(raw_output=raw_output) print("Parsed Output:") - print(parsed) + print(parsed.model_dump()) **Expected Output Format** @@ -92,6 +92,30 @@ If the model outputs unstructured or loosely structured text, you can use GPTPar "rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine." } +The output schema is as a following (if you do not prefer to use ``.model_dump()``) to be able to use like ``result.rating`` to access the rating value or ``result.rationale`` to access the textual explanation for rating. + +.. code-block:: + + { + 'properties': { + 'rating': { + 'description': 'Rating from 1 to 5', + 'maximum': 5, + 'minimum': 1, + 'title': 'Rating', + 'type': 'integer' + }, + 'rationale': { + 'description': 'Textual explanation for the rating', + 'title': 'Rationale', + 'type': 'string' + } + }, + 'required': ['rating', 'rationale'], + 'title': 'RubricLikertScale', + 'type': 'object' + } + .. hint:: Key Components +------------------+-------------------------------------------------------+ diff --git a/docs/source/rubrics.rst b/docs/source/rubrics.rst index e264f21..b38498c 100644 --- a/docs/source/rubrics.rst +++ b/docs/source/rubrics.rst @@ -188,3 +188,4 @@ And to use rubrics: instruction = rubric.instruct() print(instruction) + print(rubric.name) diff --git a/yescieval/VERSION b/yescieval/VERSION index 9325c3c..60a2d3e 100644 --- a/yescieval/VERSION +++ b/yescieval/VERSION @@ -1 +1 @@ -0.3.0 \ No newline at end of file +0.4.0 \ No newline at end of file diff --git a/yescieval/__init__.py b/yescieval/__init__.py index f8a37bb..28c7095 100644 --- a/yescieval/__init__.py +++ b/yescieval/__init__.py @@ -9,6 +9,6 @@ MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, SpeculativeStatements, NoveltyIndicators) -from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge +from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge, GPTCustomAutoJudge from .parser import GPTParser diff --git a/yescieval/base/judge.py b/yescieval/base/judge.py index 5ef75ed..3178d2b 100644 --- a/yescieval/base/judge.py +++ b/yescieval/base/judge.py @@ -1,6 +1,6 @@ from abc import ABC from typing import Dict, Any -from . import Parser, Rubric +from . import Rubric, RubricLikertScale class Judge(ABC): @@ -8,7 +8,7 @@ class Judge(ABC): def from_pretrained(self, model_id:str, device: str="auto", token:str =""): self.model, self.tokenizer = self._from_pretrained(model_id=model_id, device=device, token=token) - def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]: + def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]] | str | RubricLikertScale: pass def _from_pretrained(self, model_id: str, device: str = "auto", token: str = "") -> [Any, Any]: diff --git a/yescieval/base/rubric.py b/yescieval/base/rubric.py index 64c37e7..6ca06fe 100644 --- a/yescieval/base/rubric.py +++ b/yescieval/base/rubric.py @@ -10,6 +10,7 @@ class Rubric(BaseModel, ABC): Subclasses must implement `verbalize`. """ system_prompt_template: str + name: str = "Rubric" papers: Dict[str, str] question: str answer: str diff --git a/yescieval/judge/__init__.py b/yescieval/judge/__init__.py index a3fe787..d0d69e3 100644 --- a/yescieval/judge/__init__.py +++ b/yescieval/judge/__init__.py @@ -1,8 +1,10 @@ -from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge +from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge +from .custom import CustomAutoJudge, GPTCustomAutoJudge __all__ = [ "AutoJudge", "AskAutoJudge", "BioASQAutoJudge", - "CustomAutoJudge" + "CustomAutoJudge", + "GPTCustomAutoJudge" ] \ No newline at end of file diff --git a/yescieval/judge/custom.py b/yescieval/judge/custom.py new file mode 100644 index 0000000..c44d1cf --- /dev/null +++ b/yescieval/judge/custom.py @@ -0,0 +1,97 @@ +from ..base import Judge, Rubric, RubricLikertScale +from .judges import AutoJudge + +import time +from typing import Dict, List +from openai import OpenAI +from transformers import AutoTokenizer, AutoModelForCausalLM +import torch +import logging + +logger = logging.getLogger(__name__) + +class CustomAutoJudge(AutoJudge): + + def _from_pretrained(self, model_id:str, device:str="auto", token:str =""): + tokenizer = AutoTokenizer.from_pretrained(model_id, + padding_side="left", + token=token) + tokenizer.pad_token = tokenizer.eos_token + model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=torch.float32, + device_map=device, + token=token + ) + return model, tokenizer + + +class GPTCustomAutoJudge(Judge): + + def from_pretrained(self, model_id: str, device: str = "auto", token: str = ""): + if not token: + raise ValueError("OpenAI API token must be provided.") + self.model_name = model_id + self.client = OpenAI(api_key=token) + + def _supports_function_calling(self) -> bool: + gpt_4_prefixes = ( + "gpt-4", # gpt4 family including gpt-4o, gpt-4o-mini, gpt-4.1, ... + "GPT-3.5", # gpt-3.5 family + ) + return any(self.model_name.startswith(prefix) for prefix in gpt_4_prefixes) + + def _output_schema(self) -> List[Dict]: + return [ + { + "name": "response_format", + "description": f"Return the `rating` and `rationale` only as a response.", + "parameters": { + "type": "object", + "properties": { + 'rating': { + "type": "number", + "description": "A numerical rating assigned to the characteristic.", + "minimum": 1, + "maximum": 5 + }, + "rationale": { + "type": "string", + "description": "The explanation for the assigned rating." + }, + }, + "required": ["rating", "rationale"] + } + } + ] + + def judge(self, rubric: Rubric, max_new_tokens: int = 150) -> RubricLikertScale: + if not self.client: + raise ValueError("Model not initialized.") + messages = rubric.instruct() + params = { + "model": self.model_name, + "messages": messages + } + if self._supports_function_calling(): + params["functions"] = self._output_schema() + + try_counter = 0 + while True: + try: + try_counter += 1 + response = self.client.chat.completions.create(**params) + message = response.choices[0].message + if self._supports_function_calling(): + parsed_output = eval(message.function_call.arguments) + else: + parsed_output = eval(message.content)[rubric.name] + evaluation = RubricLikertScale(rating=parsed_output['rating'], rationale=parsed_output['rationale']) + return evaluation + + except Exception as e: + logger.error(f"{try_counter} times failed attempt!") + logger.warning(f"API call failed, retrying in 4 seconds: {e}") + time.sleep(5) + + diff --git a/yescieval/judge/judges.py b/yescieval/judge/judges.py index 00c736d..c700b28 100644 --- a/yescieval/judge/judges.py +++ b/yescieval/judge/judges.py @@ -4,7 +4,9 @@ from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel, PeftConfig import torch +import logging +logger = logging.getLogger(__name__) class AutoJudge(Judge): @@ -25,7 +27,7 @@ def _from_pretrained(self, model_id:str, device:str="auto", token:str =""): model = PeftModel.from_pretrained(base_model, model_id) return model, tokenizer - def evaluate(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]: + def judge(self, rubric: Rubric, max_new_tokens: int=150) -> str: inputs = self.tokenizer.apply_chat_template(rubric.instruct(), add_generation_prompt=True, return_dict=True, @@ -49,20 +51,3 @@ def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1 device: str = "auto", token: str = ""): self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token) - - - -class CustomAutoJudge(AutoJudge): - - def _from_pretrained(self, model_id:str, device:str="auto", token:str =""): - tokenizer = AutoTokenizer.from_pretrained(model_id, - padding_side="left", - token=token) - tokenizer.pad_token = tokenizer.eos_token - model = AutoModelForCausalLM.from_pretrained( - model_id, - torch_dtype=torch.float32, - device_map=device, - token=token - ) - return model, tokenizer diff --git a/yescieval/rubric/breadth.py b/yescieval/rubric/breadth.py index efaf0bc..dfcc99e 100644 --- a/yescieval/rubric/breadth.py +++ b/yescieval/rubric/breadth.py @@ -22,7 +22,7 @@ -1. geographic_coverage: is the information in the answer a correct representation of the spatial scope of the provided abstracts? +1. Geographic Coverage: is the information in the answer a correct representation of the spatial scope of the provided abstracts? @@ -42,7 +42,7 @@ { - "geographic_coverage": {"rating": "4", "rationale": "The synthesis accurately represents multiple regions and scales from the provided abstracts, with only minor omissions or irrelevant details."} + "Geographic Coverage": {"rating": "4", "rationale": "The synthesis accurately represents multiple regions and scales from the provided abstracts, with only minor omissions or irrelevant details."} } @@ -51,6 +51,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class GeographicCoverage(Rubric): + name: str = "Geographic Coverage" system_prompt_template: str = geographic_coverage_prompt intervention_diversity_prompt = """ @@ -75,7 +76,7 @@ class GeographicCoverage(Rubric): -1. intervention_diversity: is the answer a comprehensive encapsulation of the relevant information in the provided abstracts, measured by the number of unique management practices? +1. Intervention Diversity: is the answer a comprehensive encapsulation of the relevant information in the provided abstracts, measured by the number of unique management practices? @@ -95,7 +96,7 @@ class GeographicCoverage(Rubric): { - "intervention_diversity": {"rating": "4", "rationale": "The answer includes almost all relevant interventions from the provided abstracts, with only minor details missing."} + "Intervention Diversity": {"rating": "4", "rationale": "The answer includes almost all relevant interventions from the provided abstracts, with only minor details missing."} } @@ -104,6 +105,7 @@ class GeographicCoverage(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class InterventionDiversity(Rubric): + name: str = "Intervention Diversity" system_prompt_template: str = intervention_diversity_prompt biodiversity_dimensions_prompt = """ @@ -128,7 +130,7 @@ class InterventionDiversity(Rubric): -1. biodiversity_dimensions: is the answer a comprehensive representation of the relevant biodiversity information in the provided abstracts, measured by the presence of terms related to taxonomic, functional, phylogenetic, and spatial diversity? +1. Biodiversity Dimensions: is the answer a comprehensive representation of the relevant biodiversity information in the provided abstracts, measured by the presence of terms related to taxonomic, functional, phylogenetic, and spatial diversity? @@ -148,7 +150,7 @@ class InterventionDiversity(Rubric): { - "biodiversity_dimensions": {"rating": "4", "rationale": "Most information is informative for the research question, capturing the key biodiversity dimensions with minor omissions."} + "Biodiversity Dimensions": {"rating": "4", "rationale": "Most information is informative for the research question, capturing the key biodiversity dimensions with minor omissions."} } @@ -157,6 +159,7 @@ class InterventionDiversity(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class BiodiversityDimensions(Rubric): + name: str = "Biodiversity Dimensions" system_prompt_template: str = biodiversity_dimensions_prompt ecosystem_services_prompt = """ @@ -181,7 +184,7 @@ class BiodiversityDimensions(Rubric): -1. ecosystem_services: is the answer a useful and informative reply to the question, measured by the presence of terms matched against a vocabulary aligned with the Millennium Ecosystem Assessment? +1. Ecosystem Services: is the answer a useful and informative reply to the question, measured by the presence of terms matched against a vocabulary aligned with the Millennium Ecosystem Assessment? @@ -201,7 +204,7 @@ class BiodiversityDimensions(Rubric): { - "ecosystem_services": {"rating": "4", "rationale": "The synthesis includes nearly all relevant ecosystem services from the provided abstracts, with only minor omissions."} + "Ecosystem Services": {"rating": "4", "rationale": "The synthesis includes nearly all relevant ecosystem services from the provided abstracts, with only minor omissions."} } @@ -210,6 +213,7 @@ class BiodiversityDimensions(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class EcosystemServices(Rubric): + name: str = "Ecosystem Services" system_prompt_template: str = ecosystem_services_prompt spatial_scale_prompt = """ @@ -234,7 +238,7 @@ class EcosystemServices(Rubric): -1. spatial_scale: is the answer a useful and informative reply to the question, measured by the presence of explicit scale terms (e.g., “local,” “regional,” “continental”) and area measures? +1. Spatial Scale: is the answer a useful and informative reply to the question, measured by the presence of explicit scale terms (e.g., “local,” “regional,” “continental”) and area measures? @@ -254,7 +258,7 @@ class EcosystemServices(Rubric): { - "spatial_scale": {"rating": "4", "rationale": "The synthesis includes nearly all relevant spatial scale information from the provided abstracts, with only minor omissions."} + "Spatial Scale": {"rating": "4", "rationale": "The synthesis includes nearly all relevant spatial scale information from the provided abstracts, with only minor omissions."} } @@ -263,6 +267,7 @@ class EcosystemServices(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class SpatialScale(Rubric): + name: str = "Spatial Scale" system_prompt_template: str = spatial_scale_prompt diff --git a/yescieval/rubric/depth.py b/yescieval/rubric/depth.py index 04aeb00..3e12dc3 100644 --- a/yescieval/rubric/depth.py +++ b/yescieval/rubric/depth.py @@ -22,7 +22,7 @@ -1. mechanistic_understanding: does the answer reflect understanding of ecological processes by explicitly mentioning recognized mechanisms such as feedbacks, nutrient cycling, or trophic cascades? +1. Mechanistic Understanding: does the answer reflect understanding of ecological processes by explicitly mentioning recognized mechanisms such as feedbacks, nutrient cycling, or trophic cascades? @@ -41,7 +41,7 @@ { - "mechanistic_understanding": {"rating": "4", "rationale": "The answer explains a clear multi-step ecological mechanism using causal language, but some temporal or boundary details are only briefly addressed."} + "Mechanistic Understanding": {"rating": "4", "rationale": "The answer explains a clear multi-step ecological mechanism using causal language, but some temporal or boundary details are only briefly addressed."} } @@ -50,6 +50,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class MechanisticUnderstanding(Rubric): + name: str = "Mechanistic Understanding" system_prompt_template: str = mechanistic_understanding_prompt causal_reasoning_prompt = """ @@ -74,7 +75,7 @@ class MechanisticUnderstanding(Rubric): -1. causal_reasoning: does the answer explicitly express cause–effect relationships using causal connectives (e.g., “because,” “due to”), result indicators (e.g., “results in,” “induces”), or mechanistic verbs (e.g., “drives,” “regulates”) when describing ecological processes? +1. Causal Reasoning: does the answer explicitly express cause–effect relationships using causal connectives (e.g., “because,” “due to”), result indicators (e.g., “results in,” “induces”), or mechanistic verbs (e.g., “drives,” “regulates”) when describing ecological processes? @@ -94,7 +95,7 @@ class MechanisticUnderstanding(Rubric): { - "causal_reasoning": {"rating": "4", "rationale": "The answer uses clear causal connectors and describes a multi-step cause–effect relationship."} + "Causal Reasoning": {"rating": "4", "rationale": "The answer uses clear causal connectors and describes a multi-step cause–effect relationship."} } @@ -103,6 +104,7 @@ class MechanisticUnderstanding(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class CausalReasoning(Rubric): + name: str = "Causal Reasoning" system_prompt_template: str = causal_reasoning_prompt temporal_precision_prompt = """ @@ -127,7 +129,7 @@ class CausalReasoning(Rubric): -1. temporal_precision: does the answer include specific and explicit temporal references, such as quantified time intervals or dated events, rather than vague or unspecific timing? +1. Temporal Precision: does the answer include specific and explicit temporal references, such as quantified time intervals or dated events, rather than vague or unspecific timing? @@ -147,7 +149,7 @@ class CausalReasoning(Rubric): { - "temporal_precision": {"rating": "4", "rationale": "The answer includes several specific timeframes or durations that are clearly linked to the described processes, though some timing details could be more precise."} + "Temporal Precision": {"rating": "4", "rationale": "The answer includes several specific timeframes or durations that are clearly linked to the described processes, though some timing details could be more precise."} } @@ -156,5 +158,6 @@ class CausalReasoning(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class TemporalPrecision(Rubric): + name: str = "Temporal Precision" system_prompt_template: str = temporal_precision_prompt diff --git a/yescieval/rubric/gap.py b/yescieval/rubric/gap.py index facdcbd..8c6fa2a 100644 --- a/yescieval/rubric/gap.py +++ b/yescieval/rubric/gap.py @@ -22,7 +22,7 @@ -1. gap_identification: To what extent does the answer explicitly identify research gaps or unanswered questions indicated by the provided abstracts? +1. Gap Identification: To what extent does the answer explicitly identify research gaps or unanswered questions indicated by the provided abstracts? @@ -42,7 +42,7 @@ { - "gap_identification": {"rating": "4", "rationale": "Identifies a relevant gap supported by the abstracts, with limited elaboration."} + "Gap Identification": {"rating": "4", "rationale": "Identifies a relevant gap supported by the abstracts, with limited elaboration."} } @@ -51,4 +51,5 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class GapIdentification(Rubric): + name: str = "Gap Identification" system_prompt_template: str = gap_identification_prompt diff --git a/yescieval/rubric/informativeness.py b/yescieval/rubric/informativeness.py index 9fd6788..bdfb448 100644 --- a/yescieval/rubric/informativeness.py +++ b/yescieval/rubric/informativeness.py @@ -51,6 +51,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Correctness(Rubric): + name: str = "Correctness" system_prompt_template: str = correctness_prompt completeness_prompt = """ @@ -104,6 +105,7 @@ class Correctness(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Completeness(Rubric): + name: str = "Completeness" system_prompt_template: str = completeness_prompt informativeness_prompt = """ @@ -157,5 +159,6 @@ class Completeness(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Informativeness(Rubric): + name: str = "Informativeness" system_prompt_template: str = informativeness_prompt diff --git a/yescieval/rubric/innovation.py b/yescieval/rubric/innovation.py index 290405a..628fa5d 100644 --- a/yescieval/rubric/innovation.py +++ b/yescieval/rubric/innovation.py @@ -22,7 +22,7 @@ -1. speculative_statement: Does the answer clearly distinguish speculation (e.g., “might,” “could”) from established findings in the provided abstracts? +1. Speculative Statements: Does the answer clearly distinguish speculation (e.g., “might,” “could”) from established findings in the provided abstracts? @@ -42,7 +42,7 @@ { - "speculative_statement": {"rating": "4", "rationale": "Uses hedging appropriately and clearly distinguishes speculation from established findings."} + "Speculative Statements": {"rating": "4", "rationale": "Uses hedging appropriately and clearly distinguishes speculation from established findings."} } @@ -51,6 +51,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class SpeculativeStatements(Rubric): + name: str = "Speculative Statements" system_prompt_template: str = speculative_statements_prompt novelty_indicators_prompt = """ @@ -75,7 +76,7 @@ class SpeculativeStatements(Rubric): -1. novelty_indicators: Does the answer appropriately use self-declared innovation terms (e.g., “novel,” “pioneering,” “emerging”) and clearly indicate whether such claims are supported by the provided abstracts? +1. Novelty Indicators: Does the answer appropriately use self-declared innovation terms (e.g., “novel,” “pioneering,” “emerging”) and clearly indicate whether such claims are supported by the provided abstracts? @@ -95,7 +96,7 @@ class SpeculativeStatements(Rubric): { - "novelty_indicators": {"rating": "4", "rationale": "Shows a clear novel angle, but lacks full detail."} + "Novelty Indicators": {"rating": "4", "rationale": "Shows a clear novel angle, but lacks full detail."} } @@ -104,6 +105,7 @@ class SpeculativeStatements(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class NoveltyIndicators(Rubric): + name: str = "Novelty Indicators" system_prompt_template: str = novelty_indicators_prompt diff --git a/yescieval/rubric/rigor.py b/yescieval/rubric/rigor.py index 62c4aaf..cd428a4 100644 --- a/yescieval/rubric/rigor.py +++ b/yescieval/rubric/rigor.py @@ -22,7 +22,7 @@ -1. statistical_sophistication: Does the answer reflect quantitative depth through the use of inferential statistics or analysis methods described in the abstracts? +1. Statistical Sophistication: Does the answer reflect quantitative depth through the use of inferential statistics or analysis methods described in the abstracts? @@ -42,7 +42,7 @@ { - "statistical_sophistication": {"rating": "3", "rationale": "The synthesis provides some methodological details and basic statistics, but does not fully discuss limitations or reproducibility.""} + "Statistical Sophistication": {"rating": "3", "rationale": "The synthesis provides some methodological details and basic statistics, but does not fully discuss limitations or reproducibility.""} } @@ -51,6 +51,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class StatisticalSophistication(Rubric): + name: str = "Statistical Sophistication" system_prompt_template: str = statistical_sophistication_prompt citation_practices_prompt = """ @@ -75,7 +76,7 @@ class StatisticalSophistication(Rubric): -1. citation_practices: is the answer supported by appropriate references, using parenthetical or narrative citations, for the relevant information in the provided abstracts? +1. Citation Practices: is the answer supported by appropriate references, using parenthetical or narrative citations, for the relevant information in the provided abstracts? @@ -95,7 +96,7 @@ class StatisticalSophistication(Rubric): { - "citation_practices": {"rating": "3", "rationale": "Some claims are supported with citations, but several important points lack references or use inconsistent citation style."} + "Citation Practices": {"rating": "3", "rationale": "Some claims are supported with citations, but several important points lack references or use inconsistent citation style."} } @@ -104,6 +105,7 @@ class StatisticalSophistication(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class CitationPractices(Rubric): + name: str = "Citation Practices" system_prompt_template: str = citation_practices_prompt uncertainty_acknowledgement_prompt = """ @@ -128,7 +130,7 @@ class CitationPractices(Rubric): -1. uncertainty_acknowledgement: does the answer explicitly discuss limitations, uncertainty, or gaps in evidence (e.g., using terms like “unknown,” “limited evidence,” or “unclear”)? +1. Uncertainty Acknowledgement: does the answer explicitly discuss limitations, uncertainty, or gaps in evidence (e.g., using terms like “unknown,” “limited evidence,” or “unclear”)? @@ -148,7 +150,7 @@ class CitationPractices(Rubric): { - "uncertainty_acknowledgement": {"rating": "4", "rationale": "The answer clearly acknowledges key uncertainties and limitations in the study."} + "Uncertainty Acknowledgement": {"rating": "4", "rationale": "The answer clearly acknowledges key uncertainties and limitations in the study."} } @@ -157,5 +159,6 @@ class CitationPractices(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class UncertaintyAcknowledgment(Rubric): + name: str = "Uncertainty Acknowledgement" system_prompt_template: str = uncertainty_acknowledgement_prompt diff --git a/yescieval/rubric/structural.py b/yescieval/rubric/structural.py index a968642..6b83550 100644 --- a/yescieval/rubric/structural.py +++ b/yescieval/rubric/structural.py @@ -51,6 +51,7 @@ Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Coherence(Rubric): + name: str = "Coherence" system_prompt_template: str = coherence_prompt integration_prompt = """ @@ -104,6 +105,7 @@ class Coherence(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Integration(Rubric): + name: str = "Integration" system_prompt_template: str = integration_prompt relevancy_prompt = """ @@ -157,4 +159,5 @@ class Integration(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Relevancy(Rubric): + name: str = "Relevancy" system_prompt_template: str = relevancy_prompt diff --git a/yescieval/rubric/stylistic.py b/yescieval/rubric/stylistic.py index b369fdf..0e92757 100644 --- a/yescieval/rubric/stylistic.py +++ b/yescieval/rubric/stylistic.py @@ -52,6 +52,7 @@ """ class Cohesion(Rubric): + name: str = "Cohesion" system_prompt_template: str = cohesion_prompt @@ -106,6 +107,7 @@ class Cohesion(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Conciseness(Rubric): + name: str = "Conciseness" system_prompt_template: str = conciseness_prompt readability_prompt = """ @@ -159,5 +161,6 @@ class Conciseness(Rubric): Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. """ class Readability(Rubric): + name: str = "Readability" system_prompt_template: str = readability_prompt