Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,14 +91,15 @@ Judges within YESciEval are defined as follows:
| `GPTCustomAutoJudge`| Custom GPT-based LLM that can be used as a judge within YESciEval |


A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
A total of **22** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:

```python
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,\
Integration, Cohesion, Readability, Conciseness,\
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,\
ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage,\
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,\
SpeculativeStatements, NoveltyIndicators
StateOfTheArtAndNovelty
```

A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.
Expand Down
40 changes: 21 additions & 19 deletions docs/source/rubrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,23 @@
Rubrics
===================

A total of **21** evaluation rubrics were defined as part of the YESciEval test framework within two categories presented as following:
A total of **22** evaluation rubrics were defined as part of the YESciEval test framework within two categories presented as following:

.. hint::


Here is a simple example of how to import rubrics in your code:
Here is a simple example of how to import rubrics in your code:

.. code-block:: python
.. code-block:: python

from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness,
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
SpeculativeStatements, NoveltyIndicators
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness,
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
ContextCoverage, MethodCoverage, DimensionCoverage, ScopeCoverage, ScaleCoverage,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
StateOfTheArtAndNovelty

The rubrics are presented as following:
The rubrics are presented as following:


Question Answering
Expand Down Expand Up @@ -150,7 +151,7 @@ Following ``Research Breadth Assessment`` evaluates the diversity of evidence ac
- Does the answer distribute attention across multiple distinct scales relevant to the research question?

Scientific Rigor Assessment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodological integrity of the synthesis.

Expand All @@ -161,13 +162,15 @@ Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodolo

* - Evaluation Rubric
- Description
* - **18. Quantitative Evidence And Uncertainty:**
- Does the answer appropriately handle quantitative evidence and uncertainty relevant to the research question?
* - **19. Epistemic Calibration:**
- Does the answer clearly align claim strength with evidential support by marking uncertainty, assumptions, and limitations where relevant?
* - **18. Statistical Sophistication:**
- Does the answer use statistical methods or analyses, showing quantitative rigor and depth?
* - **19. Citation Practices:**
- Does the answer properly cite sources, using parenthetical or narrative citations (e.g., “(Smith et al., 2021)”)?
* - **20. Uncertainty Acknowledgment:**
- Does the answer explicitly mention limitations or uncertainty, using terms like “unknown,” “limited evidence,” or “unclear”?

Innovation Capacity Assessment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthesis.

Expand All @@ -178,9 +181,8 @@ Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthe

* - Evaluation Rubric
- Description
* - **20. State-Of-The-Art And Novelty :**
- Does the response identify and contextualize relevant state-of-the-art or novel contributions relative to prior work?

* - **21. State Of The Art And Novelty:**
- Does the response identify specific state-of-the-art and/or novel contributions relevant to the research question, using terms like “novel,” “state-of-the-art”?

Research Gap Assessment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -194,7 +196,7 @@ Following ``Research Gap Assessment`` detects explicit acknowledgment of unanswe

* - Evaluation Rubric
- Description
* - **21. Gap Identification:**
* - **22. Gap Identification:**
- Does the answer point out unanswered questions or understudied areas, using terms like “research gap” or “understudied”?


Expand Down
3 changes: 2 additions & 1 deletion yescieval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
from .base import Rubric, Parser
from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness,
ScaleCoverage, ContextCoverage, ScopeCoverage, MethodCoverage, DimensionCoverage,
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
SpeculativeStatements, NoveltyIndicators)
StateOfTheArtAndNovelty)
from .injector import ExampleInjector, VocabularyInjector
from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge, GPTCustomAutoJudge
from .parser import GPTParser
5 changes: 3 additions & 2 deletions yescieval/base/domain.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from abc import ABC
from pydantic import BaseModel
from typing import Dict
from typing import Dict, Optional

class Domain(BaseModel, ABC):
examples: Dict[str, Dict] = None
vocab: Dict[str, Dict] = None
ID: str = None
verbalized: str = None
verbalized: str = None
vocab_block_specs: Optional[Dict[str, Dict[str, object]]] = None
4 changes: 3 additions & 1 deletion yescieval/injector/domains/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@

example_responses: Dict[str, Dict] = {domain.ID: domain.examples for domain in domains}

verbalized_domains: Dict[str, str] = {domain.ID: domain.verbalized for domain in domains}
verbalized_domains: Dict[str, str] = {domain.ID: domain.verbalized for domain in domains}

vocab_block_specs: Dict[str, Dict[str, object]] = {domain.ID: domain.vocab_block_specs for domain in domains}
109 changes: 106 additions & 3 deletions yescieval/injector/domains/ecology.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"genetic diversity", "structural diversity", "shannon", "simpson", "hill numbers"
],
"temporal_terms" :[
"within 25 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "19982004",
"within 2-5 years", "lag of ~6 months", "after 3 months", "before 12 weeks", "1998-2004",
"June 2012", "every 2 weeks"
],
"ecosystem_services": [
Expand Down Expand Up @@ -62,7 +62,20 @@
"climate change", "global warming", "drought", "heatwave", "extreme weather", "phenology", "range shift",
"sea level rise", "ocean acidification", "greenhouse gas", "carbon dioxide", "thermal stress", "precipitation"
],
"complexity_terms": ["nonlinear", "emergent", "synergistic", "interconnected", "complex", "multifaceted"]
"complexity_terms": ["nonlinear", "emergent", "synergistic", "interconnected", "complex", "multifaceted"],
"gap_identification": [
"remains unclear", "unknown", "not well understood", "limited evidence", "mixed findings", "inconsistent results", "lack of consensus",
"understudied", "data scarce", "few studies", "limited sample size", "short time horizon", "lack of longitudinal data",
"geographic bias", "taxonomic bias", "context dependence", "limited external validity", "missing comparison", "unresolved"
],
"novelty_indicators": [
"first to", "novel", "new approach", "new method", "recent advances", "state of the art", "cutting-edge",
"proof-of-concept", "pilot study", "new dataset", "long-term dataset", "high-resolution data",
"remote sensing", "satellite", "LiDAR", "eDNA", "metabarcoding", "new sampling protocol", "new monitoring approach",
"hierarchical model", "Bayesian", "causal inference", "counterfactual", "difference-in-differences", "instrumental variable",
"meta-analysis", "systematic review", "scenario analysis", "climate projection", "compared to previous studies",
"unlike prior work", "addresses a limitation"
]
}

example_responses = {
Expand Down Expand Up @@ -97,11 +110,101 @@
"rationale": "The response uses specific and bounded temporal expressions, for example describing changes occurring within 2-5 years, after 3 months, or every 2 weeks, and referencing defined time periods such as 1998-2004 or June 2012."
}
]
},
"Breadth": {
"ContextCoverage": [
{
"rating": "1",
"rationale": "The response discusses only a single ecological setting and does not reference any alternative regions or biomes relevant to the research question."
},
{
"rating": "4",
"rationale": "The response covers multiple distinct ecological contexts, such as different regions and ecosystem types, and distributes attention across them rather than focusing on a single setting."
}
],
"MethodCoverage": [
{
"rating": "1",
"rationale": "The response focuses exclusively on a single management or intervention approach (e.g., controlled burning) and does not reference any alternative methods relevant to the research question."
},
{
"rating": "4",
"rationale": "The response discusses multiple distinct interventions or management approaches, such as controlled burning, grazing management, habitat restoration, and protected areas, distributing attention across them rather than focusing on a single method."
}
],
"DimensionCoverage": [
{
"rating": "1",
"rationale": "The response focuses on a single ecological dimension and does not meaningfully address other relevant dimensions."
},
{
"rating": "4",
"rationale": "The response covers multiple ecological dimensions, including taxonomic, functional, and phylogenetic diversity, as well as measures such as species richness, evenness, abundance, and genetic and structural diversity, rather than focusing on just one single dimension."
}
],
"ScopeCoverage": [
{
"rating": "1",
"rationale": "The response addresses only a very narrow aspect of ecological impact and remains vague, providing little indication that the findings apply beyond a single, limited scope."
},
{
"rating": "4",
"rationale": "The response discusses several types of ecosystem services, from provisioning and regulating services to supporting and cultural services, rather than focusing on just one single, limited scope"
}
],
"ScaleCoverage": [
{
"rating": "1",
"rationale": "The response focuses on a single ecological scale and does not meaningfully consider how the findings apply at other relevant scales."
},
{
"rating": "4",
"rationale": "The response addresses multiple ecological scales, ranging from the individual and population level to community and ecosystem scales, and also considers broader spatial scales such as local, regional, and global contexts, rather than focusing on just one scale."
}
]
},
"Gap": {
"GapIdentification": [
{
"rating": "1",
"rationale": "The response is purely descriptive, summarizing existing ecological findings, observations, or reported patterns (e.g., species distributions, biodiversity metrics, or observed correlations) without identifying any missing, unknown, inconsistent, or unresolved aspects relevant to the research question."
},
{
"rating": "4",
"rationale": "The response clearly identifies specific gaps or limitations in the ecological evidence base that are relevant to the research question (e.g., missing data for certain regions, taxa, or time periods; limited experimental studies; or conflicting empirical findings) and provides some explanation of why these gaps matter; minor ambiguity or imprecision may remain."
}
]
},
"Innovation": {
"StateOfTheArtAndNovelty": [
{
"rating": "1",
"rationale": "The response gives a generic overview of known ecological findings or methods without identifying any specific state-of-the-art approaches or novel contributions; or it uses buzzwords like state of the art or cutting-edge without explaining what is new."
},
{
"rating": "4",
"rationale": "The response identifies concrete state-of-the-art or novel ecological contributions (e.g., new datasets, long-term or high-resolution data, remote sensing such as satellite or LiDAR, eDNA/metabarcoding, or new modeling or monitoring approaches) and briefly explains what improvement or new capability they provide, with minor gaps in comparison or detail."
}
]
}
}

vocab_block_specs = {
"mechanistic_vocab_block": {"label": "Mechanistic terms", "keys": ["mechanistic_terms"]},
"causal_vocab_block": {"label": "Causal connectives / triggers", "keys": ["causal_terms"]},
"temporal_vocab_block": {"label": "Temporal expressions", "keys": ["temporal_terms"]},
"context_coverage_vocab_block": {"label": "Context Coverage", "keys": ["regions"]},
"method_coverage_vocab_block": {"label": "Method Coverage", "keys": ["interventions"]},
"dimension_coverage_vocab_block": {"label": "Dimension Coverage", "keys": ["diversity_dimensions"]},
"scope_coverage_vocab_block": {"label": "Scope Coverage", "keys": ["ecosystem_services"]},
"scale_coverage_vocab_block": {"label": "Scale Coverage", "keys": ["scale_terms"]},
"gap_identification_vocab_block": {"label": "Gap Identification", "keys": ["gap_identification"]},
"novelty_indicators_vocab_block": {"label": "State of the Art and Novelty Indicators", "keys": ["novelty_indicators"]}
}

class Ecology(Domain):
examples: Dict[str, Dict] = example_responses
vocab: Dict[str, Dict] = vocabulary
ID: str = "ecology"
verbalized: str = "Ecology"
verbalized: str = "Ecology"
vocab_block_specs: Dict[str, Dict[str, object]] = vocab_block_specs
Loading