diff --git a/README.md b/README.md index 7d5b199..1f1a9e4 100644 --- a/README.md +++ b/README.md @@ -91,14 +91,15 @@ Judges within YESciEval are defined as follows: | `GPTCustomAutoJudge`| Custom GPT-based LLM that can be used as a judge within YESciEval | -A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code: +A total of **22** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code: ```python from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,\ Integration, Cohesion, Readability, Conciseness,\ MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,\ + ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage,\ StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,\ - SpeculativeStatements, NoveltyIndicators + StateOfTheArtAndNovelty ``` A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page. diff --git a/docs/source/rubrics.rst b/docs/source/rubrics.rst index e730359..cbfae3e 100644 --- a/docs/source/rubrics.rst +++ b/docs/source/rubrics.rst @@ -2,22 +2,23 @@ Rubrics =================== -A total of **21** evaluation rubrics were defined as part of the YESciEval test framework within two categories presented as following: +A total of **22** evaluation rubrics were defined as part of the YESciEval test framework within two categories presented as following: .. hint:: - Here is a simple example of how to import rubrics in your code: + Here is a simple example of how to import rubrics in your code: - .. code-block:: python + .. code-block:: python - from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, - Integration, Cohesion, Readability, Conciseness, - MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, - StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, - SpeculativeStatements, NoveltyIndicators + from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, + Integration, Cohesion, Readability, Conciseness, + MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, + ContextCoverage, MethodCoverage, DimensionCoverage, ScopeCoverage, ScaleCoverage, + StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, + StateOfTheArtAndNovelty - The rubrics are presented as following: + The rubrics are presented as following: Question Answering @@ -150,7 +151,7 @@ Following ``Research Breadth Assessment`` evaluates the diversity of evidence ac - Does the answer distribute attention across multiple distinct scales relevant to the research question? Scientific Rigor Assessment -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~ Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodological integrity of the synthesis. @@ -161,13 +162,15 @@ Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodolo * - Evaluation Rubric - Description - * - **18. Quantitative Evidence And Uncertainty:** - - Does the answer appropriately handle quantitative evidence and uncertainty relevant to the research question? - * - **19. Epistemic Calibration:** - - Does the answer clearly align claim strength with evidential support by marking uncertainty, assumptions, and limitations where relevant? + * - **18. Statistical Sophistication:** + - Does the answer use statistical methods or analyses, showing quantitative rigor and depth? + * - **19. Citation Practices:** + - Does the answer properly cite sources, using parenthetical or narrative citations (e.g., “(Smith et al., 2021)”)? + * - **20. Uncertainty Acknowledgment:** + - Does the answer explicitly mention limitations or uncertainty, using terms like “unknown,” “limited evidence,” or “unclear”? Innovation Capacity Assessment -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthesis. @@ -178,9 +181,8 @@ Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthe * - Evaluation Rubric - Description - * - **20. State-Of-The-Art And Novelty :** - - Does the response identify and contextualize relevant state-of-the-art or novel contributions relative to prior work? - + * - **21. State Of The Art And Novelty:** + - Does the response identify specific state-of-the-art and/or novel contributions relevant to the research question, using terms like “novel,” “state-of-the-art”? Research Gap Assessment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -194,7 +196,7 @@ Following ``Research Gap Assessment`` detects explicit acknowledgment of unanswe * - Evaluation Rubric - Description - * - **21. Gap Identification:** + * - **22. Gap Identification:** - Does the answer point out unanswered questions or understudied areas, using terms like “research gap” or “understudied”? diff --git a/yescieval/__init__.py b/yescieval/__init__.py index 5cb35f5..c7526b5 100644 --- a/yescieval/__init__.py +++ b/yescieval/__init__.py @@ -5,9 +5,10 @@ from .base import Rubric, Parser from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy, Integration, Cohesion, Readability, Conciseness, + ScaleCoverage, ContextCoverage, ScopeCoverage, MethodCoverage, DimensionCoverage, MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, - SpeculativeStatements, NoveltyIndicators) + StateOfTheArtAndNovelty) from .injector import ExampleInjector, VocabularyInjector from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge, GPTCustomAutoJudge from .parser import GPTParser diff --git a/yescieval/base/domain.py b/yescieval/base/domain.py index 679c555..a374619 100644 --- a/yescieval/base/domain.py +++ b/yescieval/base/domain.py @@ -1,9 +1,10 @@ from abc import ABC from pydantic import BaseModel -from typing import Dict +from typing import Dict, Optional class Domain(BaseModel, ABC): examples: Dict[str, Dict] = None vocab: Dict[str, Dict] = None ID: str = None - verbalized: str = None \ No newline at end of file + verbalized: str = None + vocab_block_specs: Optional[Dict[str, Dict[str, object]]] = None \ No newline at end of file diff --git a/yescieval/injector/domains/__init__.py b/yescieval/injector/domains/__init__.py index 610c316..480d9ce 100644 --- a/yescieval/injector/domains/__init__.py +++ b/yescieval/injector/domains/__init__.py @@ -8,4 +8,6 @@ example_responses: Dict[str, Dict] = {domain.ID: domain.examples for domain in domains} -verbalized_domains: Dict[str, str] = {domain.ID: domain.verbalized for domain in domains} \ No newline at end of file +verbalized_domains: Dict[str, str] = {domain.ID: domain.verbalized for domain in domains} + +vocab_block_specs: Dict[str, Dict[str, object]] = {domain.ID: domain.vocab_block_specs for domain in domains} \ No newline at end of file diff --git a/yescieval/injector/domains/ecology.py b/yescieval/injector/domains/ecology.py index 2946f86..2222f04 100644 --- a/yescieval/injector/domains/ecology.py +++ b/yescieval/injector/domains/ecology.py @@ -27,7 +27,7 @@ "genetic diversity", "structural diversity", "shannon", "simpson", "hill numbers" ], "temporal_terms" :[ - "within 2–5 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "1998–2004", + "within 2-5 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "1998-2004", "June 2012", "every 2 weeks" ], "ecosystem_services": [ @@ -62,7 +62,20 @@ "climate change", "global warming", "drought", "heatwave", "extreme weather", "phenology", "range shift", "sea level rise", "ocean acidification", "greenhouse gas", "carbon dioxide", "thermal stress", "precipitation" ], - "complexity_terms": ["nonlinear", "emergent", "synergistic", "interconnected", "complex", "multifaceted"] + "complexity_terms": ["nonlinear", "emergent", "synergistic", "interconnected", "complex", "multifaceted"], + "gap_identification": [ + "remains unclear", "unknown", "not well understood", "limited evidence", "mixed findings", "inconsistent results", "lack of consensus", + "understudied", "data scarce", "few studies", "limited sample size", "short time horizon", "lack of longitudinal data", + "geographic bias", "taxonomic bias", "context dependence", "limited external validity", "missing comparison", "unresolved" + ], + "novelty_indicators": [ + "first to", "novel", "new approach", "new method", "recent advances", "state of the art", "cutting-edge", + "proof-of-concept", "pilot study", "new dataset", "long-term dataset", "high-resolution data", + "remote sensing", "satellite", "LiDAR", "eDNA", "metabarcoding", "new sampling protocol", "new monitoring approach", + "hierarchical model", "Bayesian", "causal inference", "counterfactual", "difference-in-differences", "instrumental variable", + "meta-analysis", "systematic review", "scenario analysis", "climate projection", "compared to previous studies", + "unlike prior work", "addresses a limitation" + ] } example_responses = { @@ -97,11 +110,101 @@ "rationale": "The response uses specific and bounded temporal expressions, for example describing changes occurring within 2-5 years, after 3 months, or every 2 weeks, and referencing defined time periods such as 1998-2004 or June 2012." } ] + }, + "Breadth": { + "ContextCoverage": [ + { + "rating": "1", + "rationale": "The response discusses only a single ecological setting and does not reference any alternative regions or biomes relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response covers multiple distinct ecological contexts, such as different regions and ecosystem types, and distributes attention across them rather than focusing on a single setting." + } + ], + "MethodCoverage": [ + { + "rating": "1", + "rationale": "The response focuses exclusively on a single management or intervention approach (e.g., controlled burning) and does not reference any alternative methods relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response discusses multiple distinct interventions or management approaches, such as controlled burning, grazing management, habitat restoration, and protected areas, distributing attention across them rather than focusing on a single method." + } + ], + "DimensionCoverage": [ + { + "rating": "1", + "rationale": "The response focuses on a single ecological dimension and does not meaningfully address other relevant dimensions." + }, + { + "rating": "4", + "rationale": "The response covers multiple ecological dimensions, including taxonomic, functional, and phylogenetic diversity, as well as measures such as species richness, evenness, abundance, and genetic and structural diversity, rather than focusing on just one single dimension." + } + ], + "ScopeCoverage": [ + { + "rating": "1", + "rationale": "The response addresses only a very narrow aspect of ecological impact and remains vague, providing little indication that the findings apply beyond a single, limited scope." + }, + { + "rating": "4", + "rationale": "The response discusses several types of ecosystem services, from provisioning and regulating services to supporting and cultural services, rather than focusing on just one single, limited scope" + } + ], + "ScaleCoverage": [ + { + "rating": "1", + "rationale": "The response focuses on a single ecological scale and does not meaningfully consider how the findings apply at other relevant scales." + }, + { + "rating": "4", + "rationale": "The response addresses multiple ecological scales, ranging from the individual and population level to community and ecosystem scales, and also considers broader spatial scales such as local, regional, and global contexts, rather than focusing on just one scale." + } + ] + }, + "Gap": { + "GapIdentification": [ + { + "rating": "1", + "rationale": "The response is purely descriptive, summarizing existing ecological findings, observations, or reported patterns (e.g., species distributions, biodiversity metrics, or observed correlations) without identifying any missing, unknown, inconsistent, or unresolved aspects relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response clearly identifies specific gaps or limitations in the ecological evidence base that are relevant to the research question (e.g., missing data for certain regions, taxa, or time periods; limited experimental studies; or conflicting empirical findings) and provides some explanation of why these gaps matter; minor ambiguity or imprecision may remain." + } + ] + }, + "Innovation": { + "StateOfTheArtAndNovelty": [ + { + "rating": "1", + "rationale": "The response gives a generic overview of known ecological findings or methods without identifying any specific state-of-the-art approaches or novel contributions; or it uses buzzwords like state of the art or cutting-edge without explaining what is new." + }, + { + "rating": "4", + "rationale": "The response identifies concrete state-of-the-art or novel ecological contributions (e.g., new datasets, long-term or high-resolution data, remote sensing such as satellite or LiDAR, eDNA/metabarcoding, or new modeling or monitoring approaches) and briefly explains what improvement or new capability they provide, with minor gaps in comparison or detail." + } + ] } } +vocab_block_specs = { + "mechanistic_vocab_block": {"label": "Mechanistic terms", "keys": ["mechanistic_terms"]}, + "causal_vocab_block": {"label": "Causal connectives / triggers", "keys": ["causal_terms"]}, + "temporal_vocab_block": {"label": "Temporal expressions", "keys": ["temporal_terms"]}, + "context_coverage_vocab_block": {"label": "Context Coverage", "keys": ["regions"]}, + "method_coverage_vocab_block": {"label": "Method Coverage", "keys": ["interventions"]}, + "dimension_coverage_vocab_block": {"label": "Dimension Coverage", "keys": ["diversity_dimensions"]}, + "scope_coverage_vocab_block": {"label": "Scope Coverage", "keys": ["ecosystem_services"]}, + "scale_coverage_vocab_block": {"label": "Scale Coverage", "keys": ["scale_terms"]}, + "gap_identification_vocab_block": {"label": "Gap Identification", "keys": ["gap_identification"]}, + "novelty_indicators_vocab_block": {"label": "State of the Art and Novelty Indicators", "keys": ["novelty_indicators"]} +} + class Ecology(Domain): examples: Dict[str, Dict] = example_responses vocab: Dict[str, Dict] = vocabulary ID: str = "ecology" - verbalized: str = "Ecology" \ No newline at end of file + verbalized: str = "Ecology" + vocab_block_specs: Dict[str, Dict[str, object]] = vocab_block_specs \ No newline at end of file diff --git a/yescieval/injector/domains/nlp.py b/yescieval/injector/domains/nlp.py index c487def..ae69386 100644 --- a/yescieval/injector/domains/nlp.py +++ b/yescieval/injector/domains/nlp.py @@ -21,7 +21,7 @@ "multilingual", "cross-lingual", "low-resource" ], "temporal_terms" :[ - "within 2–5 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "1998–2004", "June 2012", "every 2 weeks" + "within 2-5 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "1998-2004", "June 2012", "every 2 weeks" ], "eval_metrics": [ "accuracy", "f1", "precision", "recall", "bleu", "chrf", "rouge", "meteor", "bertscore", "perplexity", @@ -73,6 +73,18 @@ ], "safety_terms": [ "bias", "fairness", "toxicity", "privacy", "safety", "data leakage", "red teaming", "harmful content" + ], + "gap_identification": [ + "remains unclear", "unknown", "limited evidence", "mixed results", "understudied", "few studies", "lack of benchmark", "no standard evaluation", + "dataset bias", "annotation bias", "label noise", "generalization gap", "out-of-distribution", "OOD", + "low-resource languages", "domain shift", "not evaluated", "unexplored", "open question", "unresolved" + ], + "novelty_indicators": [ + "state of the art", "SOTA", "new benchmark", "new dataset", "new architecture", "novel architecture", + "training objective", "pretraining", "fine-tuning", "instruction tuning", "RLHF", "DPO", + "retrieval-augmented", "RAG", "agentic", "tool use", "function calling", "multimodal", + "vision-language", "few-shot", "zero-shot", "scaling law", "parameter-efficient", "LoRA", + "distillation", "quantization", "compared to baselines", "outperforms prior work", "improves over", "ablation" ] } @@ -108,11 +120,101 @@ "rationale": "The response includes precise temporal details, such as model behavior observed after 3 months of training, performance changes within 2-5 years of development, or evaluations conducted every 2 weeks, with references to specific time ranges like 1998-2004 or June 2012." } ] + }, + "Breadth": { + "ContextCoverage": [ + { + "rating": "1", + "rationale": "The response focuses entirely on a single NLP task or application setting and does not mention any alternative tasks relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response addresses multiple distinct NLP tasks or application settings and distributes attention across them rather than concentrating on a single task." + } + ], + "MethodCoverage": [ + { + "rating": "1", + "rationale": "The response focuses entirely on a single training or modeling approach (e.g., fine-tuning) and does not mention any alternative methods or settings relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response addresses multiple distinct methods or settings, such as pretraining, fine-tuning, instruction tuning, and reinforcement learning from human feedback, rather than concentrating on a single approach." + } + ], + "DimensionCoverage": [ + { + "rating": "1", + "rationale": "The response relies on a single evaluation dimension and does not indicate consideration of alternative evaluation perspectives." + }, + { + "rating": "4", + "rationale": "The response evaluates performance across multiple dimensions, using metrics such as accuracy, precision, recall, F1, BLEU, ROUGE, and perplexity, providing a more complete assessment rather than relying on a single metric." + } + ], + "ScopeCoverage": [ + { + "rating": "1", + "rationale": "The response is limited to a single, narrowly defined scope and does not indicate that the findings generalize across different linguistic settings or usage scenarios." + }, + { + "rating": "4", + "rationale": "The response covers a wide range of linguistic scopes, including multiple languages such as English, German, French, and Chinese, as well as multilingual, cross-lingual, and low-resource settings, distributing attention across these distinct applicability scopes with only minor omissions." + } + ], + "ScaleCoverage": [ + { + "rating": "1", + "rationale": "The response considers only a single computational scale and does not indicate how the approach behaves under different resource or deployment settings." + }, + { + "rating": "4", + "rationale": "The response discusses multiple computational scales, including model size in terms of parameters and billion-parameter regimes, compute resources such as GPUs and TPUs, and efficiency-related aspects like inference time, latency, throughput, and memory footprint, providing a multi-scale perspective." + } + ] + }, + "Gap": { + "GapIdentification": [ + { + "rating": "1", + "rationale": "The response is purely descriptive, summarizing existing findings or benchmark results (e.g., model architectures, datasets, or reported scores) with no identification of missing, unknown, inconsistent, or unresolved aspects relevant to the research question." + }, + { + "rating": "4", + "rationale": "The response clearly identifies specific gaps or limitations in the evidence base that are relevant to the research question (e.g., missing evaluations, underexplored domains or tasks, lack of ablation studies, limited robustness or generalization analysis, dataset biases, or conflicting benchmark results) and provides some explanation of why these gaps matter; minor ambiguity or imprecision may remain." + } + ] + }, + "Innovation": { + "StateOfTheArtAndNovelty": [ + { + "rating": "1", + "rationale": "The response gives a generic overview of common NLP methods without identifying any specific state-of-the-art systems or novel contributions; or it uses buzzwords like SOTA or state of the art without explaining what is new." + }, + { + "rating": "4", + "rationale": "The response identifies concrete state-of-the-art or novel NLP contributions (e.g., a new dataset or benchmark, a new or modified model architecture, RAG, instruction tuning, RLHF/DPO, multimodal models, or parameter-efficient methods like LoRA) and briefly explains what improvement or new capability they provide, with minor gaps in comparison or detail." + } + ] } } +vocab_block_specs = { + "mechanistic_vocab_block": {"label": "Mechanistic terms", "keys": ["training_terms", "arch_terms", "ablation_terms"]}, + "causal_vocab_block": {"label": "Causal connectives / triggers", "keys": ["causal_terms"]}, + "temporal_vocab_block": {"label": "Temporal expressions", "keys": ["temporal_terms"]}, + "context_coverage_vocab_block": {"label": "Context Coverage", "keys": ["tasks"]}, + "method_coverage_vocab_block": {"label": "Method Coverage", "keys": ["training_terms", "arch_terms"]}, + "dimension_coverage_vocab_block": {"label": "Dimension Coverage", "keys": ["eval_metrics"]}, + "scope_coverage_vocab_block": {"label": "Scope Coverage", "keys": ["languages"]}, + "scale_coverage_vocab_block": {"label": "Scale Coverage", "keys": ["compute_terms"]}, + "gap_identification_vocab_block": {"label": "Gap Identification", "keys": ["gap_identification"]}, + "novelty_indicators_vocab_block": {"label": "State of the Art and Novelty Indicators", "keys": ["novelty_indicators"]} +} + class NLP(Domain): examples: Dict[str, Dict] = example_responses vocab: Dict[str, Dict] = vocabulary ID: str = 'nlp' - verbalized: str = "NLP" \ No newline at end of file + verbalized: str = "NLP" + vocab_block_specs: Dict[str, Dict[str, object]] = vocab_block_specs diff --git a/yescieval/injector/vocab.py b/yescieval/injector/vocab.py index 28057c1..0ff3eb9 100644 --- a/yescieval/injector/vocab.py +++ b/yescieval/injector/vocab.py @@ -1,6 +1,6 @@ from abc import ABC from typing import Dict, List -from .domains import vocabs, verbalized_domains +from .domains import verbalized_domains, vocab_block_specs, vocabs class VocabularyInjector(ABC): """ @@ -11,8 +11,15 @@ class VocabularyInjector(ABC): "{MECHANISTIC_VOCAB}": "mechanistic_vocab_block", "{CAUSAL_VOCAB}": "causal_vocab_block", "{TEMPORAL_VOCAB}": "temporal_vocab_block", + "{CONTEXT_VOCAB}": "context_coverage_vocab_block", + "{METHOD_VOCAB}": "method_coverage_vocab_block", + "{DIMENSION_VOCAB}": "dimension_coverage_vocab_block", + "{SCOPE_VOCAB}": "scope_coverage_vocab_block", + "{SCALE_VOCAB}": "scale_coverage_vocab_block", + "{GAP_IDENTIFICATION_VOCAB}": "gap_identification_vocab_block", + "{NOVELTY_INDICATORS_VOCAB}": "novelty_indicators_vocab_block" } - + def _clean_terms(self, terms) -> List[str]: seen_terms = set() cleaned_terms = [] @@ -25,34 +32,30 @@ def _clean_terms(self, terms) -> List[str]: seen_terms.add(term) cleaned_terms.append(term) return cleaned_terms - - def mechanistic_vocab_block(self, domain_id: str) -> str: - terms = vocabs[domain_id].get("mechanistic_terms") - label = "Mechanistic terms" - label += f" ({verbalized_domains.get(domain_id)})" if verbalized_domains.get(domain_id) else "" - if domain_id == "nlp": - terms = (vocabs[domain_id].get("training_terms") + - vocabs[domain_id].get("arch_terms") + - vocabs[domain_id].get("ablation_terms")) + + def _build_vocab_block(self, domain_id: str, block_name: str) -> str: + spec = vocab_block_specs.get(domain_id, {}).get(block_name, {}) + keys = spec.get("keys", []) + terms = [] + domain_vocab = vocabs.get(domain_id, {}) + for key in keys: + terms.extend(domain_vocab.get(key, []) or []) terms = self._clean_terms(terms) + if not terms: + return "" + label = spec.get("label", block_name.replace("_", " ").title()) + if verbalized_domains.get(domain_id): + label += f" ({verbalized_domains[domain_id]})" return f"{label}: " + ", ".join(terms) - def causal_vocab_block(self, domain_id: str) -> str: - terms = self._clean_terms(vocabs[domain_id].get("causal_terms")) - return "Causal connectives / triggers: " + ", ".join(terms) - - def temporal_vocab_block(self, domain_id: str) -> str: - terms = self._clean_terms(vocabs[domain_id].get("temporal_terms")) - return "Temporal expressions: " + ", ".join(terms) - def format_prompt(self, prompt: str, domain: str) -> str: """ Replaces known placeholders in the prompt with vocab blocks based on the domain. """ domain_id = domain.strip().lower() - for placeholder, method in self.placeholders.items(): + for placeholder, block_name in self.placeholders.items(): if placeholder in prompt: - block_fn = getattr(self, method) - prompt = prompt.replace(placeholder, block_fn(domain_id)) - return prompt + vocab_block = self._build_vocab_block(domain_id, block_name) + prompt = prompt.replace(placeholder, vocab_block) + return prompt \ No newline at end of file diff --git a/yescieval/rubric/__init__.py b/yescieval/rubric/__init__.py index 72ce6d4..c8f9566 100644 --- a/yescieval/rubric/__init__.py +++ b/yescieval/rubric/__init__.py @@ -4,11 +4,13 @@ from .depth import MechanisticUnderstanding, CausalReasoning, TemporalPrecision from .gap import GapIdentification from .rigor import StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment -from .innovation import SpeculativeStatements, NoveltyIndicators +from .innovation import StateOfTheArtAndNovelty +from .breadth import ScaleCoverage, ContextCoverage, ScopeCoverage, MethodCoverage, DimensionCoverage __all__ = ["Informativeness", "Correctness", "Completeness", "Coherence", "Relevancy", "Integration", "Cohesion", "Readability", "Conciseness", "MechanisticUnderstanding", "CausalReasoning", "TemporalPrecision", + "ScaleCoverage", "ContextCoverage", "ScopeCoverage", "MethodCoverage", "DimensionCoverage", "GapIdentification", "StatisticalSophistication", "CitationPractices", - "UncertaintyAcknowledgment", "SpeculativeStatements", "NoveltyIndicators"] + "UncertaintyAcknowledgment", "StateOfTheArtAndNovelty"] diff --git a/yescieval/rubric/breadth.py b/yescieval/rubric/breadth.py index dfcc99e..e6719cc 100644 --- a/yescieval/rubric/breadth.py +++ b/yescieval/rubric/breadth.py @@ -1,273 +1,319 @@ from ..base import Rubric -geographic_coverage_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +context_coverage_prompt = """ +Scientific question answering and synthesis from multiple sources require balancing depth of explanation with breadth of coverage. Breadth is a core dimension of synthesis quality: it captures how widely a response distributes attention across the range of contexts that are relevant to a research question, rather than concentrating narrowly on a single setting. -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on correctness characteristic, ensuring it effectively communicates the synthesized information. +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; breadth should be evaluated independently of presentation style. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +One aspect of breadth is contextual coverage, which concerns whether a response addresses multiple distinct contexts relevant to the research question, rather than repeatedly elaborating on the same one. High breadth reflects coverage of multiple distinct and pertinent contexts without requiring exhaustive detail on any individual context. A response can exhibit high breadth even when individual contexts are treated concisely, as long as attention is distributed across them. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response covers multiple distinct contexts relevant to the research question, rather than repeatedly elaborating on a single context. Your judgment should be based solely on the provided question and response. -1. Geographic Coverage: is the information in the answer a correct representation of the spatial scope of the provided abstracts? +ContextCoverage: Does the response distribute attention across multiple distinct and relevant contexts (rather than concentrating narrowly on one), thereby demonstrating breadth of coverage with respect to the research question? + +Below are domain-specific examples of terms that often signal different contexts. They are examples only: their presence is not required, and repetition of the same context does not increase the score. +{CONTEXT_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Geographic Coverage -Rating 1. Very bad: The synthesis consistently misrepresents or inaccurately portrays the geographic scope of the provided abstracts, covering only a single context or ignoring relevant regions. -Rating 2. Bad: The synthesis represents some regions correctly but overlooks several important biogeographic zones or scales, showing limited breadth. -Rating 3. Moderate: The synthesis captures most relevant regions and some scale diversity, but may miss minor zones or nuances in spatial coverage. -Rating 4. Good: The synthesis provides a broad and accurate representation of multiple regions and scales, triangulating evidence across sources with minor omissions. -Rating 5. Very good: The synthesis comprehensively represents all relevant regions, scales, and contexts from the provided abstracts, accurately covering the geographic breadth without omissions. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +ContextCoverage +Rating 1. Very bad: The response focuses almost entirely on a single context, with no meaningful indication of alternative contexts relevant to the research question. +Rating 2. Bad: The response mentions more than one context, but coverage is very limited or heavily skewed toward a single dominant context. +Rating 3. Moderate: The response covers several distinct contexts relevant to the research question, but breadth is uneven or some important contexts are missing. +Rating 4. Good: The response covers a broad range of distinct and relevant contexts, distributing attention reasonably well across them, with only minor omissions. +Rating 5. Very good: The response demonstrates wide contextual breadth, clearly covering many distinct and relevant contexts and distributing attention across them rather than concentrating on any single one. + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that points to distinct contexts mentioned in the response and explains how they contribute to breadth. - +Return your response in JSON format: { - "Geographic Coverage": {"rating": "4", "rationale": "The synthesis accurately represents multiple regions and scales from the provided abstracts, with only minor omissions or irrelevant details."} + "ContextCoverage": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward repetition or verbosity; reward the number of distinct contexts covered. A single sentence may contribute to multiple contexts if it clearly references them. This rubric does not assess correctness, evidential grounding, or explanatory depth. """ -class GeographicCoverage(Rubric): - name: str = "Geographic Coverage" - system_prompt_template: str = geographic_coverage_prompt -intervention_diversity_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +class ContextCoverage(Rubric): + name: str = "ContextCoverage" + system_prompt_template: str = context_coverage_prompt -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on completeness characteristic, ensuring it effectively communicates the synthesized information. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +method_coverage_prompt = """ +Scientific question answering and synthesis from multiple sources require balancing depth of explanation with breadth of coverage. Breadth is a core dimension of synthesis quality: it captures how widely a response distributes attention across the range of relevant aspects of a research question, rather than concentrating narrowly on a single one. + +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; breadth should be evaluated independently of presentation style. + +One aspect of breadth is method coverage, which concerns whether a response addresses multiple distinct methods, interventions, or experimental or operational settings relevant to the research question, rather than repeatedly focusing on the same approach. High breadth reflects coverage of multiple distinct methods without requiring exhaustive detail on any single one. This rubric focuses exclusively on breadth of method or intervention coverage. Other aspects of scientific quality (such as factual accuracy, evidential grounding, or explanatory depth) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response covers a range of distinct methods, interventions, or settings relevant to the research question, rather than elaborating repeatedly on a single method. Your judgment should be based solely on the provided question and response. -1. Intervention Diversity: is the answer a comprehensive encapsulation of the relevant information in the provided abstracts, measured by the number of unique management practices? +MethodCoverage: Does the response distribute attention across multiple distinct methods, interventions, or experimental/operational settings relevant to the research question? + +Below are domain-specific examples of terms that often signal different methods or interventions. They are examples only: their presence is not required, and repetition of the same method does not increase the score. +{METHOD_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Intervention Diversity -Rating 1. Very bad: The synthesis omits most of the relevant interventions, capturing very few management practices from the provided abstracts. -Rating 2. Bad: The synthesis misses several important interventions, representing only a limited subset of management practices. -Rating 3. Moderate: The synthesis captures a fair number of interventions, but some relevant management practices are overlooked. -Rating 4. Good: The synthesis includes nearly all relevant interventions, missing only minor management practices. -Rating 5. Very good: The synthesis comprehensively captures all relevant interventions and management practices from the provided abstracts, without omissions. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +MethodCoverage +Rating 1. Very bad: The response focuses almost entirely on a single method or intervention, with no meaningful indication of alternative approaches. +Rating 2. Bad: The response mentions more than one method or intervention, but coverage is very limited or strongly skewed toward a single dominant approach. +Rating 3. Moderate: The response covers several distinct methods or interventions relevant to the research question, but breadth is uneven or some important approaches are missing. +Rating 4. Good: The response covers a broad range of distinct and relevant methods or interventions, distributing attention reasonably well across them, with only minor omissions. +Rating 5. Very good: The response demonstrates wide method coverage, clearly addressing many distinct and relevant methods or interventions and distributing attention across them rather than concentrating on any single one. + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that points to distinct methods or interventions mentioned in the response and explains how they contribute to breadth. - +Return your response in JSON format: { - "Intervention Diversity": {"rating": "4", "rationale": "The answer includes almost all relevant interventions from the provided abstracts, with only minor details missing."} + "MethodCoverage": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward repetition or verbosity; reward the number of distinct methods or interventions covered. A single sentence may contribute to multiple methods if it clearly references them. """ -class InterventionDiversity(Rubric): - name: str = "Intervention Diversity" - system_prompt_template: str = intervention_diversity_prompt -biodiversity_dimensions_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +class MethodCoverage(Rubric): + name: str = "MethodCoverage" + system_prompt_template: str = method_coverage_prompt + +dimension_coverage_prompt = """ +Scientific question answering and synthesis from multiple sources require balancing depth of explanation with breadth of coverage. Breadth is a core dimension of synthesis quality: it captures how widely a response distributes attention across the range of relevant aspects of a research question, rather than concentrating narrowly on a single one. + +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; breadth should be evaluated independently of presentation style. -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on informativeness characteristic, ensuring it effectively communicates the synthesized information. +One aspect of breadth is dimension coverage, which concerns whether a response addresses multiple distinct descriptive or evaluative dimensions relevant to the research question, rather than repeatedly focusing on the same one. High breadth reflects coverage of multiple distinct dimensions without requiring exhaustive detail on any single one. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +This rubric focuses exclusively on breadth of dimension coverage. Other aspects of scientific quality (such as factual accuracy, evidential grounding, or explanatory depth) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response covers a range of distinct dimensions relevant to the research question, rather than elaborating repeatedly on a single dimension. Your judgment should be based solely on the provided question and response. -1. Biodiversity Dimensions: is the answer a comprehensive representation of the relevant biodiversity information in the provided abstracts, measured by the presence of terms related to taxonomic, functional, phylogenetic, and spatial diversity? +DimensionCoverage: Does the response distribute attention across multiple distinct descriptive or evaluative dimensions relevant to the research question? + +Below are domain-specific examples of terms that often signal different dimensions. They are examples only: their presence is not required, and repetition of the same dimension does not increase the score. +{DIMENSION_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Biodiversity Dimensions -Rating 1. Very bad: The synthesis omits most of the relevant biodiversity information, capturing very few or none of the taxonomic, functional, phylogenetic, or spatial diversity aspects. -Rating 2. Bad: The synthesis covers some biodiversity dimensions but misses several key aspects or contexts. -Rating 3. Moderate: The synthesis captures a fair number of biodiversity dimensions, but some relevant terms or contexts are overlooked. -Rating 4. Good: The synthesis includes nearly all relevant biodiversity dimensions, touching multiple contexts and scales, with only minor omissions. -Rating 5. Very good: The synthesis comprehensively captures all relevant biodiversity dimensions from the provided abstracts, accurately representing taxonomic, functional, phylogenetic, and spatial diversity without omissions. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +DimensionCoverage +Rating 1. Very bad: The response focuses almost entirely on a single dimension, with no meaningful indication of alternative dimensions. +Rating 2. Bad: The response mentions more than one dimension, but coverage is very limited or strongly skewed toward a single dominant one. +Rating 3. Moderate: The response covers several distinct dimensions relevant to the research question, but breadth is uneven or some important dimensions are missing. +Rating 4. Good: The response covers a broad range of distinct and relevant dimensions, distributing attention reasonably well across them, with only minor omissions. +Rating 5. Very good: The response demonstrates wide dimension coverage, clearly addressing many distinct and relevant dimensions and distributing attention across them rather than concentrating on any single one. + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that points to distinct dimensions mentioned in the response and explains how they contribute to breadth. - +Return your response in JSON format: { - "Biodiversity Dimensions": {"rating": "4", "rationale": "Most information is informative for the research question, capturing the key biodiversity dimensions with minor omissions."} + "DimensionCoverage": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward repetition or verbosity; reward the number of distinct dimensions covered. A single sentence may contribute to multiple dimensions if it clearly references them. """ -class BiodiversityDimensions(Rubric): - name: str = "Biodiversity Dimensions" - system_prompt_template: str = biodiversity_dimensions_prompt -ecosystem_services_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +class DimensionCoverage(Rubric): + name: str = "DimensionCoverage" + system_prompt_template: str = dimension_coverage_prompt + +scope_coverage_prompt = """ +Scientific question answering and synthesis from multiple sources require balancing depth of explanation with breadth of coverage. Breadth is a core dimension of synthesis quality: it captures how widely a response distributes attention across the range of relevant aspects of a research question, rather than concentrating narrowly on a single one. + +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; breadth should be evaluated independently of presentation style. -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on informativeness characteristic, ensuring it effectively communicates the synthesized information. +One aspect of breadth is scope coverage, which concerns whether a response addresses multiple distinct scopes of applicability, impact, or relevance associated with the research question, rather than repeatedly focusing on a single scope. High breadth reflects coverage of multiple distinct scopes without requiring exhaustive detail on any single one. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +This rubric focuses exclusively on breadth of scope coverage. Other aspects of scientific quality (such as factual accuracy, evidential grounding, or explanatory depth) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response covers a range of distinct scopes relevant to the research question (e.g., different beneficiary groups, functional roles, or applicability ranges), rather than elaborating repeatedly on a single scope. Your judgment should be based solely on the provided question and response. -1. Ecosystem Services: is the answer a useful and informative reply to the question, measured by the presence of terms matched against a vocabulary aligned with the Millennium Ecosystem Assessment? +ScopeCoverage: Does the response distribute attention across multiple distinct scopes of applicability, impact, or relevance associated with the research question? + +Below are domain-specific examples of terms that often signal different scopes. They are examples only: their presence is not required, and repetition of the same scope does not increase the score. +{SCOPE_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Ecosystem Services -Rating 1. Very bad: The synthesis omits most relevant ecosystem services, capturing very few or none of the terms from the Millennium Ecosystem Assessment vocabulary. -Rating 2. Bad: The synthesis covers some ecosystem services but misses several key services or contexts. -Rating 3. Moderate: The synthesis captures a fair number of ecosystem services, but some relevant terms or contexts are overlooked. -Rating 4. Good: The synthesis includes nearly all relevant ecosystem services, touching multiple contexts and scales, with only minor omissions. -Rating 5. Very good: The synthesis comprehensively captures all relevant ecosystem services from the provided abstracts, accurately representing terms aligned with the Millennium Ecosystem Assessment vocabulary without omissions. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +ScopeCoverage +Rating 1. Very bad: The response focuses almost entirely on a single scope of applicability or impact, with no meaningful indication of alternative scopes. +Rating 2. Bad: The response mentions more than one scope, but coverage is very limited or strongly skewed toward a single dominant scope. +Rating 3. Moderate: The response covers several distinct scopes relevant to the research question, but breadth is uneven or some important scopes are missing. +Rating 4. Good: The response covers a broad range of distinct and relevant scopes, distributing attention reasonably well across them, with only minor omissions. +Rating 5. Very good: The response demonstrates wide scope coverage, clearly addressing many distinct and relevant scopes and distributing attention across them rather than concentrating on any single one. + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that points to distinct scopes mentioned in the response and explains how they contribute to breadth. - +Return your response in JSON format: { - "Ecosystem Services": {"rating": "4", "rationale": "The synthesis includes nearly all relevant ecosystem services from the provided abstracts, with only minor omissions."} + "ScopeCoverage": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward repetition or verbosity; reward the number of distinct scopes covered. A single sentence may contribute to multiple scopes if it clearly references them. """ -class EcosystemServices(Rubric): - name: str = "Ecosystem Services" - system_prompt_template: str = ecosystem_services_prompt -spatial_scale_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +class ScopeCoverage(Rubric): + name: str = "ScopeCoverage" + system_prompt_template: str = scope_coverage_prompt + -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on informativeness characteristic, ensuring it effectively communicates the synthesized information. +scale_coverage_prompt = """ +Scientific question answering and synthesis from multiple sources require balancing depth of explanation with breadth of coverage. Breadth is a core dimension of synthesis quality: it captures how widely a response distributes attention across the range of relevant aspects of a research question, rather than concentrating narrowly on a single one. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; breadth should be evaluated independently of presentation style. + +One aspect of breadth is scale coverage, which concerns whether a response addresses multiple distinct scales relevant to the research question, rather than repeatedly focusing on a single level. High breadth reflects coverage of multiple distinct scales without requiring exhaustive detail on any single one. + +This rubric focuses exclusively on breadth of scale coverage. Other aspects of scientific quality (such as factual accuracy, evidential grounding, or explanatory depth) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response covers a range of distinct scales relevant to the research question (e.g., from fine-grained to coarse-grained levels), rather than elaborating repeatedly on a single scale. Your judgment should be based solely on the provided question and response. -1. Spatial Scale: is the answer a useful and informative reply to the question, measured by the presence of explicit scale terms (e.g., “local,” “regional,” “continental”) and area measures? +ScaleCoverage: Does the response distribute attention across multiple distinct scales of analysis, organization, or application relevant to the research question? + +Below are domain-specific examples of terms that often signal different scales. They are examples only: their presence is not required, and repetition of the same scale does not increase the score. +{SCALE_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Spatial Scale -Rating 1. Very bad: The synthesis omits most relevant spatial scale information, capturing very few or none of the scale terms or area measures. -Rating 2. Bad: The synthesis covers some scale information but misses several key scales or contexts. -Rating 3. Moderate: The synthesis captures a fair amount of spatial scale information, but some relevant terms or area measures are overlooked. -Rating 4. Good: The synthesis includes nearly all relevant spatial scale information, touching multiple scales and contexts, with only minor omissions. -Rating 5. Very good: The synthesis comprehensively captures all relevant spatial scale information from the provided abstracts, accurately representing scale terms and area measures without omissions. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +ScaleCoverage +Rating 1. Very bad: The response focuses almost entirely on a single scale, with no meaningful indication of alternative scales relevant to the research question. +Rating 2. Bad: The response mentions more than one scale, but coverage is very limited or strongly skewed toward a single dominant scale. +Rating 3. Moderate: The response covers several distinct scales relevant to the research question, but breadth is uneven or some important scales are missing. +Rating 4. Good: The response covers a broad range of distinct and relevant scales, distributing attention reasonably well across them, with only minor omissions. +Rating 5. Very good: The response demonstrates wide scale coverage, clearly addressing many distinct and relevant scales and distributing attention across them rather than concentrating on any single one. + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that points to distinct scales mentioned in the response and explains how they contribute to breadth. - +Return your response in JSON format: { - "Spatial Scale": {"rating": "4", "rationale": "The synthesis includes nearly all relevant spatial scale information from the provided abstracts, with only minor omissions."} + "ScaleCoverage": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward repetition or verbosity; reward the number of distinct scales covered. A single sentence may contribute to multiple scales if it clearly references them. """ -class SpatialScale(Rubric): - name: str = "Spatial Scale" - system_prompt_template: str = spatial_scale_prompt - +class ScaleCoverage(Rubric): + name: str = "ScaleCoverage" + system_prompt_template: str = scale_coverage_prompt diff --git a/yescieval/rubric/depth.py b/yescieval/rubric/depth.py index ef74a5c..f97e1b6 100644 --- a/yescieval/rubric/depth.py +++ b/yescieval/rubric/depth.py @@ -26,7 +26,6 @@ Below are domain-specific terms and phrases that often signal mechanistic discussion. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score. - {MECHANISTIC_VOCAB} @@ -52,9 +51,7 @@ - {EXAMPLE_RESPONSES} - @@ -91,7 +88,6 @@ class MechanisticUnderstanding(Rubric): Below are examples of causal connectives and expressions that often signal causal reasoning (across domains). They are examples only: their presence is not required, and their presence alone is not sufficient for a high score. - {CAUSAL_VOCAB} @@ -117,9 +113,7 @@ class MechanisticUnderstanding(Rubric): - {EXAMPLE_RESPONSES} - @@ -156,7 +150,6 @@ class CausalReasoning(Rubric): Below are examples of temporal expressions. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score. - {TEMPORAL_VOCAB} @@ -182,9 +175,7 @@ class CausalReasoning(Rubric): - {EXAMPLE_RESPONSES} - diff --git a/yescieval/rubric/gap.py b/yescieval/rubric/gap.py index 8c6fa2a..9135480 100644 --- a/yescieval/rubric/gap.py +++ b/yescieval/rubric/gap.py @@ -1,55 +1,65 @@ from ..base import Rubric -gap_identification_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +gap_identification_prompt = """ +Scientific question answering and synthesis often require more than listing findings: high-quality scientific writing identifies what remains unknown, insufficiently addressed, or unresolved in existing research. This is commonly expressed through gap identification, where the text specifies limitations, missing knowledge, unresolved inconsistencies, or missing connections in prior work and explains why they matter for the research question. -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on informativeness characteristic, ensuring it effectively communicates the synthesized information. +Gap identification can refer to (a) field-level gaps across the literature (e.g., missing populations/settings, inconsistent measures, lack of comparative studies, limited external validity), and/or (b) recurring study-level limitations that materially constrain conclusions (e.g., small samples, short time horizons) when framed as evidence-base limitations. High-quality gap statements are specific and scoped (what is missing, where, and why), rather than generic (“more research is needed”). If gap emphasis is not central to the user's question, it should not be forced; however, when included, it should remain relevant and well-defined. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; gap identification should be evaluated independently of presentation style. + +This rubric focuses exclusively on the presence and quality of gap identification within the provided text, emphasizing explicit and relevant statements of limitations, unanswered questions, or missing connections in prior work rather than summaries of what is already known. Other aspects of scientific quality (such as factual accuracy, evidential grounding, completeness, or innovation) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response identifies what is missing or unresolved in the evidence base relevant to the research question, rather than only describing existing findings. Your judgment should be based solely on the provided question and response. -1. Gap Identification: To what extent does the answer explicitly identify research gaps or unanswered questions indicated by the provided abstracts? +GapIdentification: Does the response identify gaps relevant to the research question by specifying limitations, missing knowledge, unresolved issues, or missing connections in prior work (preferably at the evidence-base/field level), rather than only describing existing findings? + +Below are terms and phrases that often signal gap identification. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score. +{GAP_IDENTIFICATION_VOCAB} + + -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Gap Identification -Rating 1. Very bad: The synthesis does not identify any research gaps or unanswered questions, or it introduces gaps that are contradictory to or unsupported by the provided abstracts. -Rating 2. Bad: The synthesis refers to research gaps only in a vague or generic manner (e.g., “more research is needed”) without clearly specifying what is missing or grounding the claims in the provided abstracts. -Rating 3. Moderate: The synthesis identifies potential research gaps, but the description is partially vague, weakly justified, or only loosely connected to the content of the provided abstracts. -Rating 4. Good: The synthesis clearly identifies one or more research gaps that are supported by the provided abstracts, though the gaps may lack full operational detail or discussion of implications. -Rating 5. Very good: The synthesis explicitly and clearly identifies well-defined research gaps or unanswered questions that are directly supported by the provided abstracts, specifying what is missing, where, and why it matters, without relying on vague or generic statements. +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. + +GapIdentification +Rating 1. Very bad: The response is purely descriptive, summarizing existing findings or observations with no identification of missing, unknown, inconsistent, or unresolved aspects relevant to the research question. +Rating 2. Bad: The response refers to gaps only in a vague or generic manner (e.g., “more research is needed”) without clearly specifying what is missing or unresolved in the context of the research question. +Rating 3. Moderate: The response identifies one or more potential gaps with partial clarity, but the nature, scope, location (where in the literature), or relevance of the gaps is incomplete, unclear, or inconsistently articulated. +Rating 4. Good: The response clearly identifies specific gaps or limitations in the evidence base that are relevant to the research question (e.g., missing comparisons, populations, settings, measures, time horizons, or conflicting results) and provides some explanation of why they matter; minor ambiguity or imprecision may remain. +Rating 5. Very good: The response explicitly and clearly identifies well-defined gaps or unanswered questions, specifying what is missing, where it occurs in the evidence base (or across studies), and why it matters for the research question, without relying on vague statements. Gap statements are appropriately scoped (avoiding absolute claims like “no research exists” unless clearly justified within the response). + -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale that cites specific statements indicating whether the response clearly identifies research gaps in the context of the research question (what is missing/unresolved, where, and why it matters). - +Return your response in JSON format: { - "Gap Identification": {"rating": "4", "rationale": "Identifies a relevant gap supported by the abstracts, with limited elaboration."} + "GapIdentification": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward keyword mentions; reward distinct, clearly described gaps in existing knowledge that are relevant to the research question and scoped appropriately. This rubric does not assess factual correctness, evidential grounding, completeness, or innovation. """ + class GapIdentification(Rubric): - name: str = "Gap Identification" - system_prompt_template: str = gap_identification_prompt + name: str = "GapIdentification" + system_prompt_template: str = gap_identification_prompt \ No newline at end of file diff --git a/yescieval/rubric/innovation.py b/yescieval/rubric/innovation.py index 628fa5d..cb61677 100644 --- a/yescieval/rubric/innovation.py +++ b/yescieval/rubric/innovation.py @@ -1,111 +1,65 @@ from ..base import Rubric -speculative_statements_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: +state_of_the_art_and_novelty_prompt = """ +Scientific question answering and synthesis often require more than listing findings: high-quality scientific writing can surface what is genuinely innovative in the literature and explain how it differs from prior or established approaches. In synthesis settings (e.g., reports summarizing multiple papers), this is expressed by identifying specific novel contributions (e.g., new methods, new datasets, new capabilities, new theoretical framings, proof-of-concept results) and situating them relative to an implicit or explicit baseline (what was done before, what limitation is addressed, what capability is newly enabled). -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on correctness characteristic, ensuring it effectively communicates the synthesized information. +Importantly, this rubric is not about using buzzwords like “breakthrough” or “state-of-the-art” in isolation. High scores require novelty to be concrete and meaningfully contextualized (new relative to what, and why it matters). Also, not every research question requires emphasizing novelty: some questions primarily ask for established consensus or background. In such cases, it can be appropriate to focus on established knowledge; however, a strong response may still note whether the field is mature versus rapidly evolving, or explicitly state that novelty emphasis is not central to the question. -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. +The response may be a single paragraph or a long-form report with multiple sections. There are no strict requirements on length or formatting; innovation should be evaluated independently of presentation style. + +This rubric focuses exclusively on the presence and quality of innovation identification within the provided text—i.e., whether the response highlights specific novel contributions and explains their significance relative to prior work—rather than merely summarizing established knowledge or using generic novelty language. Other aspects of scientific quality (such as factual accuracy, evidential grounding, completeness, or mechanistic depth) are intentionally outside its scope and are assessed by separate evaluation criteria. -You are tasked as a scientific syntheses quality evaluator. +You are tasked as a scientific writing quality evaluator. -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. +A user will provide you with: +1) a research question, and +2) a written response intended to address that question. + +You must evaluate the response using the evaluation characteristic below. Focus on whether the response identifies and contextualizes innovation in a concrete, relevant way (what is new, relative to what, and why it matters), rather than relying on vague novelty indicators or merely repeating established knowledge. Your judgment should be based solely on the provided question and response. -1. Speculative Statements: Does the answer clearly distinguish speculation (e.g., “might,” “could”) from established findings in the provided abstracts? +StateOfTheArtAndNovelty: Does the response identify specific state-of-the-art and/or novel contributions relevant to the research question (e.g., new methods, datasets, capabilities, theoretical framings, proof-of-concept results), and meaningfully contextualize them relative to prior or established work (i.e., new relative to what, and why it matters)? If novelty emphasis is not central to the question, does the response avoid forced novelty and (optionally) state that the evidence base is mature or that innovation is not the focus? - -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Speculative Statement -Rating 1. Very bad: No innovation is present; the synthesis does not differ from prior work and may present speculation as fact. -Rating 2. Bad: The synthesis shows little originality, relies on vague statements (e.g., “more research is needed”), and does not clearly distinguish from prior work. -Rating 3. Moderate: The synthesis shows some originality, but the novel aspects are weak, underspecified, or not clearly differentiated from prior work. -Rating 4. Good: The synthesis presents a clear novel angle or synthesis compared to prior work, with speculation appropriately flagged but limited in depth or specificity. -Rating 5. Very good: The synthesis offers a genuinely novel synthesis or perspective, clearly distinguishes itself from prior work, appropriately bounds speculation, and proposes concrete, testable next steps. - - - -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} - - -{ - "Speculative Statements": {"rating": "4", "rationale": "Uses hedging appropriately and clearly distinguishes speculation from established findings."} -} - - - - -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. -""" -class SpeculativeStatements(Rubric): - name: str = "Speculative Statements" - system_prompt_template: str = speculative_statements_prompt - -novelty_indicators_prompt = """ -Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires: + +Below are terms and phrases that often co-occur with innovation claims. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score. +{NOVELTY_INDICATORS_VOCAB} + -- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources. -- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question. -- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information. -- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question. -- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability. -- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on completeness characteristic, ensuring it effectively communicates the synthesized information. - -In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness. - - - -You are tasked as a scientific syntheses quality evaluator. - - - -A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim. - + +For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below. - -1. Novelty Indicators: Does the answer appropriately use self-declared innovation terms (e.g., “novel,” “pioneering,” “emerging”) and clearly indicate whether such claims are supported by the provided abstracts? - +StateOfTheArtAndNovelty +Rating 1. Very bad: The response provides only established/background knowledge or a generic summary, with no identification of specific state-of-the-art or novel contributions where such identification would be relevant; or it uses novelty buzzwords (“breakthrough”, “SOTA”) without any concrete explanation. +Rating 2. Bad: The response occasionally signals state-of-the-art or novelty, but claims are vague, generic, or weakly connected to the research question; novelty is not contextualized (no clear “new relative to what”) and/or seems forced. +Rating 3. Moderate: The response identifies at least one potentially state-of-the-art or novel contribution, but description, relevance, or significance is partially unclear; contextualization relative to prior work is limited or inconsistent; proof-of-concept vs established advances may be conflated. +Rating 4. Good: The response clearly highlights multiple specific state-of-the-art and/or innovative contributions relevant to the research question and provides reasonable contextualization (what limitation is addressed or what capability is newly enabled), with minor gaps in baseline comparison or scope. +Rating 5. Very good: The response provides a coherent, well-structured account of state-of-the-art and novelty tightly aligned with the research question: it identifies multiple specific novel contributions, clearly explains how each differs from prior/established approaches (explicit or implicit baseline), and articulates why it matters (capabilities, limitations addressed, or new directions), while appropriately scoping claims (e.g., proof-of-concept vs broadly validated). If novelty emphasis is not central to the question, it avoids forced novelty and explicitly frames the maturity/innovation relevance appropriately. - -For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic. - -1. Novelty Indicators -Rating 1. Very bad: No novelty indicators are present, or novelty claims are incorrect or unsupported. -Rating 2. Bad: Uses vague novelty claims (e.g., “more research is needed”) or presents speculation as fact, with no clear distinction from prior work. -Rating 3. Moderate: Indicates some novelty, but the claims are weak, generic, or not clearly differentiated from prior work. -Rating 4. Good: Shows a clear novel angle or synthesis compared to prior work, with speculation appropriately flagged but limited in detail. -Rating 5. Very good: Clearly signals innovation with a distinct novel synthesis or perspective, properly bounds speculation, and proposes concrete, testable next steps. -For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating. -Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}} +Rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale pointing to specific parts of the response that (a) identify concrete state-of-the-art/novel contributions and (b) contextualize them relative to prior work (or, if innovation is not central, show that the response appropriately avoids forced novelty). - +Return your response in JSON format: { - "Novelty Indicators": {"rating": "4", "rationale": "Shows a clear novel angle, but lacks full detail."} + "StateOfTheArtAndNovelty": {"rating": "", "rationale": ""} } - + +{EXAMPLE_RESPONSES} + + -Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material. +Your evaluation must be based solely on the provided research question and response. Do not reward novelty buzzwords by themselves; reward specific, relevant identification of what is new, contextualization relative to prior work (“new compared to what”), and appropriate scoping of innovation claims. This rubric does not assess factual correctness, evidential grounding, completeness, or mechanistic depth. """ -class NoveltyIndicators(Rubric): - name: str = "Novelty Indicators" - system_prompt_template: str = novelty_indicators_prompt - +class StateOfTheArtAndNovelty(Rubric): + name: str = "StateOfTheArtAndNovelty" + system_prompt_template: str = state_of_the_art_and_novelty_prompt \ No newline at end of file