Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ celerybeat.pid
# Environments
.env
.venv
myenv/
env/
venv/
ENV/
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
## Changelog

### v0.3.0 (December 20, 2025)
- Add more rubrics (PR #3)
- Update documentation for new rubrics
- Minor bug fixing
- Update Readme

### v0.2.0 (May 30, 2025)
- Add custom judge module.
- Add documentation.
Expand Down
46 changes: 33 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,32 +87,52 @@ Judges within YESciEval are defined as follows:
| `AutoJudge` | Base class for loading and running evaluation models with PEFT adapters. |
| `AskAutoJudge` | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
| `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. |
| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics |
| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics |

A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:

```python
from yescieval import Informativeness, Correctness, Completeness,
Coherence, Relevancy, Integration,
Cohesion, Readability, Conciseness
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, \
Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \
SpeculativeStatements, NoveltyIndicators

```

A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.

## 💡 Acknowledgements

If you use YESciEval in your research, please cite:
If you find this repository helpful or use YESciEval in your work or research, feel free to cite our publication:


```bibtex
@article{d2025yescieval,
title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
journal={arXiv preprint arXiv:2505.14279},
year={2025}
}
@inproceedings{dsouza-etal-2025-yescieval,
title = "{YES}ci{E}val: Robust {LLM}-as-a-Judge for Scientific Question Answering",
author = {D{'}Souza, Jennifer and
Babaei Giglou, Hamed and
M{\"u}nch, Quentin},
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.675/",
doi = "10.18653/v1/2025.acl-long.675",
pages = "13749--13783",
ISBN = "979-8-89176-251-0"
}
```
> For other type of citations please refer to https://aclanthology.org/2025.acl-long.675/.


This work is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
This software is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).



Expand Down
22 changes: 21 additions & 1 deletion docs/source/quickstart.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Quickstart
=================

YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** using a pretrained judge and parse LLM output into structured JSON.
YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** and **gap identification** using a pretrained & a custom judge and parse LLM output into structured JSON.


**Example: Evaluating an Answer Using Informativeness + AskAutoJudge**
Expand Down Expand Up @@ -46,6 +46,26 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
- Use the ``device="cuda"`` if running on GPU for better performance.
- Add more rubrics such as ``Informativeness``, ``Relevancy``, etc for multi-criteria evaluation.


**Example: Evaluating an Answer Using GapIdentification + CustomAutoJudge**

.. code-block:: python

from yescieval import GapIdentification, CustomAutoJudge

# Step 1: Create a rubric
rubric = GapIdentification(papers=papers, question=question, answer=answer)
instruction_prompt = rubric.instruct()

# Step 2: Load the evaluation model (judge)
judge = CustomAutoJudge()
judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
print("Raw Evaluation Output:")
print(result)

**Parsing Raw Output with GPTParser**

If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.
Expand Down
104 changes: 100 additions & 4 deletions docs/source/rubrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Rubrics
===================

A total of nine evaluation rubrics were defined as part of the YESciEval test framework.
A total of twenty three (23) evaluation rubrics were defined as part of the YESciEval test framework.

Linguistic & Stylistic Quality
---------------------------------
Expand Down Expand Up @@ -59,6 +59,99 @@ Following ``Content Accuracy & Informativeness`` ensures that the response is bo
* - **9. Informativeness:**
- Is the answer a useful and informative reply to the problem?

Research Depth Assessment
---------------------------------

Following ``Research Depth Assessment`` quantifies the mechanistic and analytical sophistication of synthesis outputs.


.. list-table::
:header-rows: 1
:widths: 20 80

* - Evaluation Rubric
- Description
* - **10. Mechanistic Understanding:**
- Does the answer show understanding of ecological processes, using indicators like “feedback,” “nutrient cycling,” or “trophic cascade”?
* - **11. Causal Reasoning:**
- Does the answer show clear cause-effect relationships using words like “because,” “results in,” or “drives”?
* - **12. Temporal Precision:**
- Does the answer include specific time references, like intervals (“within 6 months”) or dates (“1990–2020”)?

Research Breadth Assessment
---------------------------------

Following ``Research Breadth Assessment`` evaluates the diversity of evidence across spatial, ecological, and methodological contexts.


.. list-table::
:header-rows: 1
:widths: 20 80

* - Evaluation Rubric
- Description
* - **13. Geographic Coverage:**
- Does the answer cover multiple biogeographic zones, such as “Tropical” or “Boreal”?
* - **14. Intervention Diversity:**
- Does the answer include a variety of management practices?
* - **15. Biodiversity Dimensions:**
- Does the answer mention different aspects of biodiversity, like taxonomic, functional, phylogenetic, or spatial diversity?
* - **16. Ecosystem Services:**
- Does the answer include relevant ecosystem services, based on the Millennium Ecosystem Assessment vocabulary?
* - **17. Spatial Scale:**
- Does the answer specify the spatial scale, using terms like “local,” “regional,” or “continental” and area measures?

Scientific Rigor Assessment
---------------------------------

Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodological integrity of the synthesis.


.. list-table::
:header-rows: 1
:widths: 20 80

* - Evaluation Rubric
- Description
* - **18. Statistical Sophistication:**
- Does the answer use statistical methods or analyses, showing quantitative rigor and depth?
* - **19. Citation Practices:**
- Does the answer properly cite sources, using parenthetical or narrative citations (e.g., “(Smith et al., 2021)”)?
* - **20. Uncertainty Acknowledgment:**
- Does the answer explicitly mention limitations or uncertainty, using terms like “unknown,” “limited evidence,” or “unclear”?

Innovation Capacity Assessment
---------------------------------

Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthesis.


.. list-table::
:header-rows: 1
:widths: 20 80

* - Evaluation Rubric
- Description
* - **21. Speculative Statements:**
- Does the answer include cautious or hypothetical statements, using words like “might,” “could,” or “hypothetical”?
* - **22. Novelty Indicators :**
- Does the answer highlight innovation using terms like “novel,” “pioneering,” or “emerging”?


Research Gap Assessment
---------------------------------

Following ``Research Gap Assessment`` detects explicit acknowledgment of unanswered questions or understudied areas in the synthesis.


.. list-table::
:header-rows: 1
:widths: 20 80

* - Evaluation Rubric
- Description
* - **23. Gap Identification:**
- Does the answer point out unanswered questions or understudied areas, using terms like “research gap” or “understudied”?


Usage Example
Expand All @@ -68,9 +161,12 @@ Here is a simple example of how to import rubrics in your code:

.. code-block:: python

from yescieval import Informativeness, Correctness, Completeness,
Coherence, Relevancy, Integration,
Cohesion, Readability, Conciseness
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness, GeographicCoverage,
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
SpeculativeStatements, NoveltyIndicators

And to use rubrics:

Expand Down
13 changes: 9 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[tool.poetry]
name = "YESciEval"

version = "0.2.0"

version = "0.0.0"
description = "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering."
authors = ["Hamed Babaei Giglou <hamedbabaeigiglou@gmail.com>"]
license = "MIT License"
Expand Down Expand Up @@ -30,6 +29,12 @@ wheel = "*"
twine = "*"
pytest = "*"

[tool.poetry-dynamic-versioning]
enable = true
style = "semver"
source = "attr"
attr = "yescieval.__version__"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
requires = ["poetry-core>=1.0.0", "poetry-dynamic-versioning>=1.4.0"]
build-backend = "poetry_dynamic_versioning.backend"
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
from setuptools import setup, find_packages
import os

with open("README.md", encoding="utf-8") as f:
long_description = f.read()

setup(
name="YESciEval",
version="0.2.0",
version=open(os.path.join(os.path.dirname(__file__), 'yescieval/VERSION')).read().strip(),
author="Hamed Babaei Giglou",
author_email="hamedbabaeigiglou@gmail.com",
description="YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering.",
Expand Down
1 change: 1 addition & 0 deletions yescieval/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.3.0
9 changes: 7 additions & 2 deletions yescieval/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
from pathlib import Path

__version__ = "0.2.0"
__version__ = (Path(__file__).parent / "VERSION").read_text().strip()

from .base import Rubric, Parser
from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness)
Integration, Cohesion, Readability, Conciseness, GeographicCoverage,
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
SpeculativeStatements, NoveltyIndicators)
from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
from .parser import GPTParser

4 changes: 2 additions & 2 deletions yescieval/judge/judges.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,13 @@ class AskAutoJudge(AutoJudge):
def from_pretrained(self, model_id:str="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
device:str="auto",
token:str =""):
return super()._from_pretrained(model_id=model_id, device=device, token=token)
self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)

class BioASQAutoJudge(AutoJudge):
def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B",
device: str = "auto",
token: str = ""):
return super()._from_pretrained(model_id=model_id, device=device, token=token)
self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)



Expand Down
11 changes: 10 additions & 1 deletion yescieval/rubric/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
from .informativeness import Informativeness, Correctness, Completeness
from .structural import Coherence, Relevancy, Integration
from .stylistic import Cohesion, Readability, Conciseness
from .breadth import GeographicCoverage, InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale
from .depth import MechanisticUnderstanding, CausalReasoning, TemporalPrecision
from .gap import GapIdentification
from .rigor import StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment
from .innovation import SpeculativeStatements, NoveltyIndicators

__all__ = ["Informativeness", "Correctness", "Completeness",
"Coherence", "Relevancy", "Integration",
"Cohesion", "Readability", "Conciseness"]
"Cohesion", "Readability", "Conciseness", "GeographicCoverage",
"InterventionDiversity", "BiodiversityDimensions", "EcosystemServices",
"SpatialScale", "MechanisticUnderstanding", "CausalReasoning", "TemporalPrecision",
"GapIdentification", "StatisticalSophistication", "CitationPractices",
"UncertaintyAcknowledgment", "SpeculativeStatements", "NoveltyIndicators"]
Loading