+Large Language Models (LLMs) have become pivotal in powering scientific question-answering across modern search engines, yet their evaluation robustness remains largely underexplored. To address this gap, we introduce **YESciEval** โ an open-source framework that leverages fine-grained rubric-based assessments combined with reinforcement learning to reduce optimism bias in LLM evaluators.
-
+YESciEval provides a comprehensive library for evaluating the quality of synthesized scientific answers using predefined rubrics and sophisticated LLM-based judgment models. This framework enables you to assess answers on key criteria by utilizing pretrained judges and parsing LLM outputs into structured JSON formats for detailed analysis.
+
+
+## ๐งช Installation
+You can install ``YESciEval`` from PyPI using pip:
+
+```bash
+pip install yescieval
+```
+Next, verify the installation:
+```python
+import yescieval
+
+print(yescieval.__version__)
+```
+
+## ๐ Essential Resources
+
+Specialized Judges within YESciEval are:
+
+| Judge | Domain | Dataset Used | ๐ค Hugging Face |
+|----------------|------------------------------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
+| **Ask Judge** | Multidisciplinary (33 disciplines) | [ORKGSyn (Open Research Knowledge Graph)](https://data.uni-hannover.de/dataset/yescieval-corpus) | [SciKnowOrg/YESciEval-ASK-Llama-3.1-8B](https://huggingface.co/SciKnowOrg/YESciEval-ASK-Llama-3.1-8B) |
+| **BioASQ Judge**| Biomedical | [BioASQ](https://data.uni-hannover.de/dataset/yescieval-corpus) | [SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B](https://huggingface.co/SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B) |
+
+
+For further information dive into YESciEval's extensive documentation to explore its models and usage at **[๐ YESciEval Documentation](https://yescieval.readthedocs.io/)**.
+## ๐ Quick Tour
+
+Get started with YESciEval in just a few lines of code. This guide demonstrates how to initialize inputs, load judge, and initiate rubric for evaluation of the answer.
+
+
+```python
+from yescieval import Readability, AutoJudge
+
+# Sample papers
+papers = {
+ "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
+ "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
+ "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
+ "Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
+ "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
+}
+
+# Question and synthesized answer
+question = "How is AI used in modern healthcare systems?"
+answer = (
+ "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
+ "and assisting in treatment planning. It also supports personalized medicine and medical imaging."
+)
+
+# Step 1: Create a rubric
+rubric = Readability(papers=papers, question=question, answer=answer)
+
+# Step 2: Load a judge model (Ask Judge by default)
+judge = AutoJudge()
+judge.from_pretrained(
+ model_id="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
+ token="your_huggingface_token",
+)
+
+# Step 3: Evaluate the answer
+result = judge.evaluate(rubric=rubric)
+print("Raw Evaluation Output:")
+print(result)
+```
+
+Judges within YESciEval are defined as follows:
+| Class Name | Description |
+| ---------------- |----------------------------------------------------------------------------------------------|
+| `AutoJudge` | Base class for loading and running evaluation models with PEFT adapters. |
+| `AskAutoJudge` | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
+| `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. |
+| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics |
-## ๐ What is the YESciEval?
+A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
+```python
+from yescieval import Informativeness, Correctness, Completeness,
+ Coherence, Relevancy, Integration,
+ Cohesion, Readability, Conciseness
+```
-Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce **YESciEval**, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. The framework is presented as f ollows:
+A complete list of rubrics are available at YESciEval [๐ Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.
+## ๐ก Acknowledgements
-We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
+If you use YESciEval in your research, please cite:
-## ๐ License
+```bibtex
+@article{d2025yescieval,
+ title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
+ author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
+ journal={arXiv preprint arXiv:2505.14279},
+ year={2025}
+ }
+```
This work is licensed under a [](https://opensource.org/licenses/MIT).
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 0000000..d0c3cbf
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS ?=
+SPHINXBUILD ?= sphinx-build
+SOURCEDIR = source
+BUILDDIR = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/make.bat b/docs/make.bat
new file mode 100644
index 0000000..9534b01
--- /dev/null
+++ b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+ set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+ echo.
+ echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+ echo.installed, then set the SPHINXBUILD environment variable to point
+ echo.to the full path of the 'sphinx-build' executable. Alternatively you
+ echo.may add the Sphinx directory to PATH.
+ echo.
+ echo.If you don't have Sphinx installed, grab it from
+ echo.http://sphinx-doc.org/
+ exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
diff --git a/docs/requirements.txt b/docs/requirements.txt
new file mode 100644
index 0000000..d02db8b
--- /dev/null
+++ b/docs/requirements.txt
@@ -0,0 +1,13 @@
+sphinx
+sphinx-rtd-theme
+sphinx_autodoc_typehints
+myst-parser
+sphinx_markdown_tables
+sphinx-copybutton
+sphinxcontrib-mermaid
+sphinx-panels
+sphinx-design
+sphinx-tabs
+sphinx-inline-tabs
+snowballstemmer
+sphinx_toolbox
diff --git a/docs/source/_static/custom.css b/docs/source/_static/custom.css
new file mode 100644
index 0000000..e69de29
diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js
new file mode 100644
index 0000000..e69de29
diff --git a/docs/source/_templates/layout.html b/docs/source/_templates/layout.html
new file mode 100644
index 0000000..72abf2e
--- /dev/null
+++ b/docs/source/_templates/layout.html
@@ -0,0 +1,9 @@
+{% extends "!layout.html" %}
+{% block extrahead %}
+
+{% endblock %}
+
+{# Override breadcrumbs with our custom template #}
+{% block breadcrumbs %}
+ {% include "breadcrumbs.html" %}
+{% endblock %}
diff --git a/docs/source/conf.py b/docs/source/conf.py
new file mode 100644
index 0000000..259b318
--- /dev/null
+++ b/docs/source/conf.py
@@ -0,0 +1,153 @@
+# Configuration file for the Sphinx documentation builder.
+# import pathlib
+# import sys
+import datetime
+import importlib
+import inspect
+import os
+
+
+from sphinx.application import Sphinx
+from sphinx.writers.html5 import HTML5Translator
+import posixpath
+
+year = str(datetime.datetime.now().year)
+project = 'YESciEval'
+copyright = year + ' SciKnowOrg'
+release = '0.1.0'
+
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+ "sphinx.ext.napoleon",
+ "sphinx.ext.autodoc",
+ "myst_parser",
+ "sphinx_markdown_tables",
+ "sphinx_copybutton",
+ "sphinx.ext.intersphinx",
+ "sphinx.ext.linkcode",
+ "sphinx_inline_tabs",
+ "sphinxcontrib.mermaid",
+ "sphinx_toolbox.collapse",
+]
+
+# autosummary_generate = True # Turn on sphinx.ext.autosummary
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# List of patterns, relative to source directory, that match files and
+# directories to include when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+include_patterns = [
+ "**",
+ "../../yescieval/",
+ "index.rst",
+]
+# Ensure exclude_patterns doesn't exclude your master document accidentally
+exclude_patterns = []
+
+# -- Options for HTML output -------------------------------------------------
+
+source_suffix = '.rst'
+
+# specify the master doc, otherwise the build at read the docs fails
+master_doc = "index"
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+html_theme = "sphinx_rtd_theme"
+
+html_theme_options = {
+ "external_links": [
+ ("Github", "https://github.com/sciknoworg/YESciEval"),
+ ],
+ "navigation_depth": 4,
+ "collapse_navigation": True
+}
+
+html_static_path = ["_static"]
+
+html_js_files = [
+ 'https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js',
+ 'custom.js'
+]
+
+html_css_files = [
+ # 'https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css',
+ 'custom.css',
+]
+
+html_show_sourcelink = True
+html_context = {
+ "display_github": True,
+ "github_user": "sciknoworg",
+ "github_repo": "YESciEval",
+ "github_version": "main/",
+}
+
+html_logo = 'images/logo.png'
+html_favicon = "images/logo.ico"
+autoclass_content = "both"
+
+# Required to get rid of some myst.xref_missing warnings
+myst_heading_anchors = 3
+
+html_copy_source = True
+def linkcode_resolve(domain, info):
+ """
+ Resolve a GitHub link for the given domain and info dictionary.
+ """
+ if domain != "py" or not info["module"]:
+ return None
+
+ # Define the GitHub repository URL
+ repo_url = "https://github.com/sciknoworg/YESciEval/blob/main"
+ branch = "main" # Update if using a different branch
+
+ # Retrieve the module and object
+ try:
+ module = importlib.import_module(info["module"])
+ except ImportError:
+ return None
+
+ # Try to get the source file and line numbers
+ try:
+ file_path = inspect.getsourcefile(module)
+ source_lines, start_line = inspect.getsourcelines(getattr(module, info["fullname"]))
+ except (TypeError, AttributeError, OSError):
+ return None
+
+ # Generate the relative file path and GitHub link
+ relative_path = os.path.relpath(file_path, start=os.path.dirname(__file__))
+ end_line = start_line + len(source_lines) - 1
+ return f"{repo_url}/blob/{branch}/{relative_path}#L{start_line}-L{end_line}"
+
+def visit_download_reference(self, node):
+ root = "https://github.com/sciknoworg/YESciEval/tree/main"
+ atts = {"class": "reference download", "download": ""}
+
+ if not self.builder.download_support:
+ self.context.append("")
+ elif "refuri" in node:
+ atts["class"] += " external"
+ atts["href"] = node["refuri"]
+ self.body.append(self.starttag(node, "a", "", **atts))
+ self.context.append("")
+ elif "reftarget" in node and "refdoc" in node:
+ atts["class"] += " external"
+ atts["href"] = posixpath.join(root, os.path.dirname(node["refdoc"]), node["reftarget"])
+ self.body.append(self.starttag(node, "a", "", **atts))
+ self.context.append("")
+ else:
+ self.context.append("")
+
+
+HTML5Translator.visit_download_reference = visit_download_reference
+
+def setup(app: Sphinx):
+ pass
diff --git a/docs/source/images/logo.ico b/docs/source/images/logo.ico
new file mode 100644
index 0000000..314efa1
Binary files /dev/null and b/docs/source/images/logo.ico differ
diff --git a/docs/source/images/logo.png b/docs/source/images/logo.png
new file mode 100644
index 0000000..d2aba61
Binary files /dev/null and b/docs/source/images/logo.png differ
diff --git a/docs/source/index.rst b/docs/source/index.rst
new file mode 100644
index 0000000..b2204df
--- /dev/null
+++ b/docs/source/index.rst
@@ -0,0 +1,70 @@
+
+
+.. raw:: html
+
+
+
+
+
+.. raw:: html
+
+
+
+
+
+
+
+
+
+
+
+
+YESciEval provides a comprehensive library for evaluating the quality of synthesized scientific answers using predefined rubrics and sophisticated LLM-based judgment models. This framework enables you to assess answers on key criteria by utilizing pretrained judges and parsing LLM outputs into structured JSON formats for detailed analysis.
+
+YESciEval was created by `Scientific Knowledge Organization (SciKnowOrg group) `_ at `Technische Informationsbibliothek (TIB) `_. Don't hesitate to open an issue on the `YESciEval repository `_ if something is broken or if you have further questions.
+
+.. seealso::
+
+ See the `Quickstart `_ for more quick information on how to use OntoLearner.
+
+
+
+If you find this repository helpful, feel free to cite our publication `YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering `_:
+
+ .. code-block:: bibtex
+
+ @article{d2025yescieval,
+ title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
+ author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
+ journal={arXiv preprint arXiv:2505.14279},
+ year={2025}
+ }
+
+
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Getting Started
+ :hidden:
+
+ installation
+ quickstart
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Evaluator
+ :hidden:
+
+ rubrics
+ judges
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Package Reference
+ :glob:
+ :hidden:
+
+ package_reference/base
+ package_reference/judge
+ package_reference/rubric
+ package_reference/
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
new file mode 100644
index 0000000..14c91d4
--- /dev/null
+++ b/docs/source/installation.rst
@@ -0,0 +1,63 @@
+Installation
+=============
+
+We recommend **Python 3.10+**,`PyTorch 1.4.0+ `_, and `transformers v4.41.0+ `_.
+
+
+Install with pip
+-----------------------
+
+.. sidebar:: Verify the installation
+
+ Once the isntallation is done, verify the installation by:
+
+ .. code-block:: python
+
+ import yescieval
+
+ print(yescieval.__version__)
+
+
+.. tab:: From PyPI
+
+ YESciEval is available on the Python Package Index at `pypi.org `_ for installation.
+ ::
+
+ pip install -U yescieval
+
+.. tab:: From GitHub
+
+ The following pip install will installs the latest version of OntoLearner from the `main` branch of the YESciEval at GitHub using `pip`.
+
+ ::
+
+ pip install git+https://github.com/sciknoworg/YESciEval.git
+
+
+Install from Source
+----------------------
+You can install YESciEval directly from source to take advantage of the bleeding edge main branch for development.
+
+
+1. Clone the repository:
+
+.. code-block:: bash
+
+ git clone https://github.com/sciknoworg/YESciEval.git
+ cd YESciEval
+
+2. (Optional but recommended) Create and activate a virtual environment:
+
+.. code-block:: bash
+
+ python -m venv venv
+ source venv/bin/activate # On Windows: venv\Scripts\activate
+
+3. Install dependencies and the library
+
+.. code-block:: bash
+
+ pip install -e .
+
+.. hint:: The -e flag installs the package in editable mode, which is ideal for developmentโchanges in the code reflect immediately.
+
diff --git a/docs/source/judges.rst b/docs/source/judges.rst
new file mode 100644
index 0000000..0eb37af
--- /dev/null
+++ b/docs/source/judges.rst
@@ -0,0 +1,91 @@
+Judges
+================
+
+YESciEval provides two pre-trained judge models designed to evaluate scientific text syntheses based on different domains and datasets:
+
+- **Ask Judge**: A multidisciplinary YESciEval judge fine-tuned on the ORKGSyn dataset from the Open Research Knowledge Graph.
+
+- **BioASQ Judge**: A biomedical YESciEval judge fine-tuned on the BioASQ dataset from the BioASQ challenge.
+
+.. hint:: Available YESciEval judge ๐ค Hugging Face:
+
+ - `Ask Judge on Hugging Face `_
+ - `BioASQ Judge on Hugging Face `_
+
+
+Using YESciEval Judges
+------------------------
+
+The following example demonstrates how to create an evaluation rubric, load a judge model, and evaluate an answer.
+
+.. code-block:: python
+
+ from yescieval import Readability, AutoJudge
+
+ papers = {
+ "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
+ "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
+ "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
+ "Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
+ "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
+ }
+
+ # Input question and synthesized answer
+ question = "How is AI used in modern healthcare systems?"
+ answer = (
+ "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
+ "and assisting in treatment planning. It also supports personalized medicine and medical imaging."
+ )
+
+ # Step 1: Create a rubric
+ rubric = Readability(papers=papers, question=question, answer=answer)
+ instruction_prompt = rubric.instruct()
+
+ # Step 2: Load the evaluation model (judge)
+ judge = AutoJudge()
+ judge.from_pretrained(model_id="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
+ token="your_huggingface_token",
+ device="cpu")
+
+ # Step 3: Evaluate the answer
+ result = judge.evaluate(rubric=rubric)
+
+ print("Raw Evaluation Output:")
+ print(result)
+
+Specialized Judges vs. Custom Models
+--------------------------------------
+
+.. list-table:: Judge Class Overview
+ :header-rows: 1
+
+ * - Class Name
+ - Description
+ * - AutoJudge
+ - Base class for loading and running evaluation models (judges) with PEFT adapters.
+ * - AskAutoJudge
+ - Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph.
+ * - BioASQAutoJudge
+ - Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge.
+
+The difference between **AskAutoJudge** and **BioASQAutoJudge** compared to **AutoJudge** is that these specialized judges have their own predefined model paths on Hugging Face, making it easier to load the respective domain-specific models.
+
+Custom Judge
+--------------------
+
+The `CustomAutoJudge` class provides flexibility to load any compatible LLM model from Hugging Face by specifying the model ID. This allows you to use any pre-trained or fine-tuned model beyond the default specialized judges using YESciEval.
+
+For example, you can load a model and evaluate a rubric like this:
+
+.. code-block:: python
+
+ # Initialize and load a custom model by specifying its Hugging Face model ID
+ judge = CustomAutoJudge()
+ judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
+
+ # Evaluate the rubric using the loaded model
+ result = judge.evaluate(rubric=rubric)
+
+ print(result)
+
+This approach allows full control over which model is used for evaluation, supporting any LLM..
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
new file mode 100644
index 0000000..f266e6d
--- /dev/null
+++ b/docs/source/quickstart.rst
@@ -0,0 +1,87 @@
+Quickstart
+=================
+
+YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** using a pretrained judge and parse LLM output into structured JSON.
+
+
+**Example: Evaluating an Answer Using Informativeness + AskAutoJudge**
+
+.. code-block:: python
+
+ from yescieval import Informativeness, AskAutoJudge, GPTParser
+
+ # Sample papers used in form of {"title": "abstract", ... }
+ papers = {
+ "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
+ "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
+ "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
+ "Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
+ "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
+ }
+
+ # Input question and synthesized answer
+ question = "How is AI used in modern healthcare systems?"
+ answer = (
+ "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
+ "and assisting in treatment planning. It also supports personalized medicine and medical imaging."
+ )
+
+ # Step 1: Create a rubric
+ rubric = Informativeness(papers=papers, question=question, answer=answer)
+ instruction_prompt = rubric.instruct()
+
+ # Step 2: Load the evaluation model (judge)
+ judge = AskAutoJudge()
+ judge.from_pretrained(token="your_huggingface_token", device="cpu")
+
+ # Step 3: Evaluate the answer
+ result = judge.evaluate(rubric=rubric)
+
+ print("Raw Evaluation Output:")
+ print(result)
+
+.. tip::
+
+ - Ensure your Hugging Face model token has access to the model (e.g., ``YESciEval-ASK-Llama-3.1-8B``).
+ - Use the ``device="cuda"`` if running on GPU for better performance.
+ - Add more rubrics such as ``Informativeness``, ``Relevancy``, etc for multi-criteria evaluation.
+
+**Parsing Raw Output with GPTParser**
+
+If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.
+
+.. code-block:: python
+
+ from yescieval import GPTParser
+
+ raw_output = "` {rating: `4`, rational: The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine.} `"
+
+ parser = GPTParser(openai_key="your_openai_key")
+
+ parsed = parser.parse(raw_output=raw_output)
+
+ print("Parsed Output:")
+ print(parsed)
+
+**Expected Output Format**
+
+.. code-block:: json
+
+ {
+ "rating": 4,
+ "rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine."
+ }
+
+.. hint:: Key Components
+
+ +------------------+-------------------------------------------------------+
+ | Component | Purpose |
+ +==================+=======================================================+
+ | Informativeness | Defines rubric to evaluate relevance to source papers |
+ +------------------+-------------------------------------------------------+
+ | AskAutoJudge | Loads and uses a judgment model to evaluate answers |
+ +------------------+-------------------------------------------------------+
+ | GPTParser | Parses loosely formatted text from LLMs into JSON |
+ +------------------+-------------------------------------------------------+
+
+
diff --git a/docs/source/rubrics.rst b/docs/source/rubrics.rst
new file mode 100644
index 0000000..3e78f14
--- /dev/null
+++ b/docs/source/rubrics.rst
@@ -0,0 +1,94 @@
+
+Rubrics
+===================
+
+A total of nine evaluation rubrics were defined as part of the YESciEval test framework.
+
+Linguistic & Stylistic Quality
+---------------------------------
+
+Following ``Linguistic & Stylistic Quality`` concerns grammar, clarity, and adherence to academic writing conventions.
+
+
+.. list-table::
+ :header-rows: 1
+ :widths: 20 80
+
+ * - Evaluation Rubric
+ - Description
+ * - **1. Cohesion:**
+ - Are the sentences connected appropriately to make the resulting synthesis cohesive?
+ * - **2. Conciseness:**
+ - Is the answer short and clear, without redundant statements?
+ * - **3. Readability:**
+ - Does the answer follow appropriate style and structure conventions for academic writing, particularly for readability?
+
+Logical & Structural Integrity
+---------------------------------
+Following ``Logical & Structural Integrity`` focuses on the reasoning and organization of information.
+
+.. list-table::
+ :header-rows: 1
+ :widths: 20 80
+
+ * - Evaluation Rubric
+ - Description
+ * - **4. Coherence:**
+ - Are the ideas connected soundly and logically?
+ * - **5. Integration:**
+ - Are the sources structurally and linguistically well-integrated, using appropriate markers of provenance/quotation and logical connectors for each reference?
+ * - **6. Relevancy:**
+ - Is the information in the answer relevant to the problem?
+
+Content Accuracy & Informativeness
+---------------------------------
+
+Following ``Content Accuracy & Informativeness`` ensures that the response is both correct and useful.
+
+
+.. list-table::
+ :header-rows: 1
+ :widths: 20 80
+
+ * - Evaluation Rubric
+ - Description
+ * - **7. Correctness:**
+ - Is the information in the answer a correct representation of the content of the provided abstracts?
+ * - **8. Completeness:**
+ - Is the answer a comprehensive encapsulation of the relevant information in the provided abstracts?
+ * - **9. Informativeness:**
+ - Is the answer a useful and informative reply to the problem?
+
+
+
+Usage Example
+--------------------------
+
+Here is a simple example of how to import rubrics in your code:
+
+.. code-block:: python
+
+ from yescieval import Informativeness, Correctness, Completeness,
+ Coherence, Relevancy, Integration,
+ Cohesion, Readability, Conciseness
+
+And to use rubrics:
+
+.. code-block:: python
+
+ # Example inputs
+ papers = {
+ "Paper 1 title": "abstract of paper 1 ...",
+ "Paper 2 title": "abstract of paper 2 ...",
+ "Paper 3 title": "abstract of paper 3 ...",
+ "Paper 4 title": "abstract of paper 4 ...",
+ "Paper 5 title": "abstract of paper 5 ..."
+ }
+ question = "What are the key findings on AI in these papers?"
+ answer = "The synthesis answer summarizing the papers."
+
+ # Instantiate a rubric, e.g. Coherence
+ rubric = Coherence(papers=papers, question=question, answer=answer)
+ instruction = rubric.instruct()
+
+ print(instruction)
diff --git a/experiments/images/confusion_matrix_vanilla_v2.pdf b/experiments/images/confusion_matrix_vanilla_v2.pdf
index 494fcbb..71220ca 100644
Binary files a/experiments/images/confusion_matrix_vanilla_v2.pdf and b/experiments/images/confusion_matrix_vanilla_v2.pdf differ
diff --git a/experiments/images/confusion_matrix_vanilla_v2.png b/experiments/images/confusion_matrix_vanilla_v2.png
index ad24d79..0ea7431 100644
Binary files a/experiments/images/confusion_matrix_vanilla_v2.png and b/experiments/images/confusion_matrix_vanilla_v2.png differ
diff --git a/experiments/notebooks/confusion-plots.ipynb b/experiments/notebooks/confusion-plots.ipynb
deleted file mode 100644
index c1d0670..0000000
--- a/experiments/notebooks/confusion-plots.ipynb
+++ /dev/null
@@ -1,107 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 89,
- "id": "ea140555-bcc6-46ab-842c-9a77af2e7cfb",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "import numpy as np\n",
- "import seaborn as sns\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "# Data from the tables\n",
- "orkgsyn_data = np.array([\n",
- " [4.85, 4.71, 4.69, 4.80],\n",
- " [4.92, 4.89, 4.87, 4.88],\n",
- " [4.81, 4.77, 4.75, 4.78],\n",
- " [4.84, 4.75, 4.72, 4.81]\n",
- "])\n",
- "\n",
- "bioasq_data = np.array([\n",
- " [4.82, 3.39, 4.82, 4.82],\n",
- " [4.85, 4.48, 4.80, 4.84],\n",
- " [4.83, 4.72, 4.80, 4.78],\n",
- " [4.82, 3.54, 4.65, 4.78]\n",
- "])\n",
- "\n",
- "# Labels for the heatmaps\n",
- "models = [\"Qwen2.5-72B\", \"LLaMA-3.1-72B\", \"LLaMA-3.1-8B\", \"Mistral-Large\"]\n",
- "\n",
- "# Create subplots\n",
- "fig, axes = plt.subplots(2, 1, figsize=(4, 6))\n",
- "\n",
- "# ORKGSyn Confusion Matrix\n",
- "sns.heatmap(orkgsyn_data, annot=True, fmt=\".2f\", cmap=\"Blues\", xticklabels=models, yticklabels=models, ax=axes[0], annot_kws={\"size\": 7}, cbar=False)\n",
- "axes[0].set_title(\"ORKG-Synthesis\", fontsize=9)\n",
- "# axes[0].set_xlabel(\"Synthesizer\", fontsize=8)\n",
- "# axes[0].set_ylabel(\"Evaluator\", fontsize=8)\n",
- "axes[0].tick_params(axis='both', which='major', labelsize=6)\n",
- "axes[0].tick_params(axis='x', rotation=0)\n",
- "\n",
- "# BioASQ Confusion Matrix\n",
- "sns.heatmap(bioasq_data, annot=True, fmt=\".2f\", cmap=\"Blues\", xticklabels=models, yticklabels=models, ax=axes[1], annot_kws={\"size\": 7}, cbar=False)\n",
- "axes[1].set_title(\"BioASQ\", fontsize=9)\n",
- "# axes[1].set_xlabel(\"Synthesizer\", fontsize=8)\n",
- "# axes[1].set_ylabel(\"Evaluator\", fontsize=8)\n",
- "axes[1].tick_params(axis='both', which='major', labelsize=6)\n",
- "axes[1].tick_params(axis='x', rotation=0)\n",
- "\n",
- "# Adjust layout and show the plots\n",
- "plt.tight_layout()\n",
- "plt.savefig(\"images/confusion_matrix_vanilla_v2.pdf\", format=\"pdf\", bbox_inches=\"tight\")\n",
- "plt.savefig(\"images/confusion_matrix_vanilla_v2.png\", format=\"png\", dpi=300, bbox_inches=\"tight\")\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c33d8b53-0bc3-4d45-9ee0-729c9071f61b",
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "456bcc74-21db-467b-8901-57bdae1a784f",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.16"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/experiments/notebooks/plot-line-chart.ipynb b/experiments/notebooks/plot-line-chart.ipynb
deleted file mode 100644
index db62db3..0000000
--- a/experiments/notebooks/plot-line-chart.ipynb
+++ /dev/null
@@ -1,209 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "1cc0b872-8de2-45d6-86a9-28e2bc784d8d",
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import json\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "from sciqaeval import config\n",
- "import math\n",
- "\n",
- "def load_data(file_path):\n",
- " with open(file_path, 'r') as f:\n",
- " return json.load(f)\n",
- "\n",
- "def compute_averages(data, criteria):\n",
- " averages = {eval_type: {criterion: [] for criterion in criteria} for eval_type in [\"original\", \"extreme\", \"subtle\"]}\n",
- " for entry in data:\n",
- " eval_type = entry[\"eval_type\"]\n",
- " quality = entry[\"quality\"]\n",
- " rating = entry[\"synthesis_evaluation_rating\"]\n",
- " if quality in averages[eval_type]:\n",
- " if not math.isnan(rating):\n",
- " averages[eval_type][quality].append(float(rating))\n",
- " for eval_type in averages:\n",
- " for quality in averages[eval_type]:\n",
- " if averages[eval_type][quality]:\n",
- " averages[eval_type][quality] = sum(averages[eval_type][quality])/len(averages[eval_type][quality])\n",
- " else:\n",
- " averages[eval_type][quality] = 0\n",
- " return averages"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "83aefc6d-2e78-4b34-9f05-5e4a5a9537a1",
- "metadata": {},
- "outputs": [],
- "source": [
- "criteria = config.criteria\n",
- "\n",
- "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_meta-llama-3.1-70b-instruct_refactored_dataset.json\")\n",
- "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_meta-llama-3.1-70b-instruct_refactored_dataset.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages = [[averages1, averages2, 'Vanilla LLaMA-3.1-70B', 'gainsboro']]\n",
- "\n",
- "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_mistral-large-instruct_refactored_dataset.json\")\n",
- "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_mistral-large-instruct_refactored_dataset.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'Vanilla Mistral-Large', 'gainsboro']]\n",
- "\n",
- "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_qwen2.5-72b-instruct_refactored_dataset.json\")\n",
- "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_qwen2.5-72b-instruct_refactored_dataset.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'Vanilla Qwen2.5-72B', 'gainsboro']]\n",
- "\n",
- "dataset1 = load_data(\"dataset/BioASQ/BioASQ-test-refactored-dataset.json\")\n",
- "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn-test-refactored-dataset.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'Vanilla LLaMA-3.1-8B', 'teal']]\n",
- "\n",
- "dataset1 = load_data(\"assets/sft-bioasq-org-test.json\")\n",
- "dataset2 = load_data(\"assets/sft-orkg-synthesis-org-test.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'SFT (benign)', 'orange']]\n",
- "\n",
- "dataset1 = load_data(\"assets/rlhf-bioasq-adv-test.json\")\n",
- "dataset2 = load_data(\"assets/rlhf-orkg-synthesis-adv-test.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'SFT (benign) + RL (adversarial)', 'tomato']]\n",
- "\n",
- "dataset1 = load_data(\"assets/rlhf-bioasq-adv-org-test.json\")\n",
- "dataset2 = load_data(\"assets/rlhf-orkg-synthesis-adv-org-test.json\")\n",
- "averages1 = compute_averages(dataset1, criteria)\n",
- "averages2 = compute_averages(dataset2, criteria)\n",
- "averages += [[averages1, averages2, 'SFT (benign) + RL (benign + adversarial)', 'yellowgreen']]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "4240a124-7f9c-47a3-a832-10252ba0e02b",
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_results(averages, criteria, dataset_name1, dataset_name2, font_size=12):\n",
- " criteria_abbreviations = {\n",
- " \"Coherence\": \"Cohr\",\n",
- " \"Cohesion\": \"Cohs\",\n",
- " \"Completeness\": \"Comp\",\n",
- " \"Conciseness\": \"Conc\",\n",
- " \"Correctness\": \"Corr\",\n",
- " \"Informativeness\": \"Info\",\n",
- " \"Integration\": \"Integ\",\n",
- " \"Readability\": \"Read\",\n",
- " \"Relevancy\": \"Relv\"\n",
- " }\n",
- " fig, axes = plt.subplots(2, 3, figsize=(10, 5), sharey=True) \n",
- " eval_types = [\"original\", \"extreme\", \"subtle\"]\n",
- " dataset_names = [dataset_name1, dataset_name2]\n",
- " legend_handles = [] \n",
- " legend_labels = [] \n",
- " for average in averages:\n",
- " avg = [average[0], average[1]] \n",
- " title = average[2]\n",
- " color = average[3]\n",
- " \n",
- " markers = {'Vanilla LLaMA-3.1-70B':'*', 'Vanilla Mistral-Large':'+', 'Vanilla Qwen2.5-72B':'^'}\n",
- "\n",
- " marker= markers.get(title, 'o')\n",
- " for i, dataset in enumerate(avg):\n",
- " for j, eval_type in enumerate(eval_types):\n",
- " ax = axes[i, j] \n",
- " line, = ax.plot(criteria, \n",
- " [dataset[eval_type][c] for c in criteria], \n",
- " marker=marker, \n",
- " linestyle='-', \n",
- " linewidth=0.9,\n",
- " color=color,\n",
- " markersize=font_size-5)\n",
- " ax.set_title(f\"{dataset_names[i]} - {eval_type.capitalize() if eval_type!='original' else 'Benign'}\", \n",
- " fontsize=font_size)\n",
- " ax.tick_params(axis='both', labelsize=font_size - 3)\n",
- " ax.grid(True, linestyle=\"--\", alpha=0.15)\n",
- " # ax.set_yticks([1, 2, 3, 4, 5])\n",
- " ax.set_xticklabels([criteria_abbreviations[crit] for crit in criteria], rotation=0)\n",
- " legend_handles.append(line)\n",
- " legend_labels.append(title)\n",
- " \n",
- " fig.legend(legend_handles, legend_labels, loc=\"upper center\", ncol=7, fontsize=font_size-3, bbox_to_anchor=(0.5, 0.97))\n",
- " plt.tight_layout(rect=[0, 0, 1, 0.95])\n",
- " plt.savefig(\"images/results_plot.pdf\", format=\"pdf\", bbox_inches=\"tight\")\n",
- " plt.savefig(\"images/results_plot.png\", format=\"png\", dpi=300, bbox_inches=\"tight\")\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "id": "8ac306d9-1c34-4cd6-94ab-b26499b1e45e",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/tmp/ipykernel_803013/3035001349.py:41: UserWarning: FixedFormatter should only be used together with FixedLocator\n",
- " ax.set_xticklabels([criteria_abbreviations[crit] for crit in criteria], rotation=0)\n"
- ]
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "plot_results(averages, criteria, \"BioASQ\", \"ORKGSynthesis\", font_size=8)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a052a3da-e40b-49be-bcc7-980ef375eb92",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.16"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/experiments/plots.ipynb b/experiments/plots.ipynb
new file mode 100644
index 0000000..472aa8d
--- /dev/null
+++ b/experiments/plots.ipynb
@@ -0,0 +1,258 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "id": "1be657cc-56bd-4d67-a55e-dc79f0cdeced",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Data from the tables\n",
+ "orkgsyn_data = np.array([\n",
+ " [4.85, 4.71, 4.69, 4.80],\n",
+ " [4.92, 4.89, 4.87, 4.88],\n",
+ " [4.81, 4.77, 4.75, 4.78],\n",
+ " [4.84, 4.75, 4.72, 4.81]\n",
+ "])\n",
+ "\n",
+ "bioasq_data = np.array([\n",
+ " [4.82, 3.39, 4.82, 4.82],\n",
+ " [4.85, 4.48, 4.80, 4.84],\n",
+ " [4.83, 4.72, 4.80, 4.78],\n",
+ " [4.82, 3.54, 4.65, 4.78]\n",
+ "])\n",
+ "\n",
+ "# Labels for the heatmaps\n",
+ "models = [\"Qwen2.5-72B\", \"LLaMA-3.1-72B\", \"LLaMA-3.1-8B\", \"Mistral-Large\"]\n",
+ "# models = [\"Q\", \"L72B\", \"L8B\", \"M\"]\n",
+ "# Create subplots\n",
+ "fig, axes = plt.subplots(2, 1, figsize=(6, 8))\n",
+ "\n",
+ "# ORKGSyn Confusion Matrix\n",
+ "sns.heatmap(orkgsyn_data, annot=True, fmt=\".2f\", cmap=\"Blues\", xticklabels=models, yticklabels=models, ax=axes[0], annot_kws={\"size\": 12}, cbar=False)\n",
+ "axes[0].set_title(\"ORKG-Synthesis\", fontsize=16)\n",
+ "# axes[0].set_xlabel(\"Synthesizer\", fontsize=8)\n",
+ "# axes[0].set_ylabel(\"Evaluator\", fontsize=8)\n",
+ "axes[0].tick_params(axis='both', which='major', labelsize=10)\n",
+ "axes[0].tick_params(axis='x', rotation=0)\n",
+ "\n",
+ "# BioASQ Confusion Matrix\n",
+ "sns.heatmap(bioasq_data, annot=True, fmt=\".2f\", cmap=\"Blues\", xticklabels=models, yticklabels=models, ax=axes[1], annot_kws={\"size\": 12}, cbar=False)\n",
+ "axes[1].set_title(\"BioASQ\", fontsize=16)\n",
+ "# axes[1].set_xlabel(\"Synthesizer\", fontsize=8)\n",
+ "# axes[1].set_ylabel(\"Evaluator\", fontsize=8)\n",
+ "axes[1].tick_params(axis='both', which='major', labelsize=10)\n",
+ "axes[1].tick_params(axis='x', rotation=0)\n",
+ "\n",
+ "# Adjust layout and show the plots\n",
+ "plt.tight_layout()\n",
+ "plt.savefig(\"images/confusion_matrix_vanilla_v2.pdf\", format=\"pdf\", bbox_inches=\"tight\")\n",
+ "plt.savefig(\"images/confusion_matrix_vanilla_v2.png\", format=\"png\", dpi=300, bbox_inches=\"tight\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "1cc0b872-8de2-45d6-86a9-28e2bc784d8d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import json\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from sciqaeval import config\n",
+ "import math\n",
+ "\n",
+ "def load_data(file_path):\n",
+ " with open(file_path, 'r') as f:\n",
+ " return json.load(f)\n",
+ "\n",
+ "def compute_averages(data, criteria):\n",
+ " averages = {eval_type: {criterion: [] for criterion in criteria} for eval_type in [\"original\", \"extreme\", \"subtle\"]}\n",
+ " for entry in data:\n",
+ " eval_type = entry[\"eval_type\"]\n",
+ " quality = entry[\"quality\"]\n",
+ " rating = entry[\"synthesis_evaluation_rating\"]\n",
+ " if quality in averages[eval_type]:\n",
+ " if not math.isnan(rating):\n",
+ " averages[eval_type][quality].append(float(rating))\n",
+ " for eval_type in averages:\n",
+ " for quality in averages[eval_type]:\n",
+ " if averages[eval_type][quality]:\n",
+ " averages[eval_type][quality] = sum(averages[eval_type][quality])/len(averages[eval_type][quality])\n",
+ " else:\n",
+ " averages[eval_type][quality] = 0\n",
+ " return averages"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "83aefc6d-2e78-4b34-9f05-5e4a5a9537a1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "criteria = config.criteria\n",
+ "\n",
+ "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_meta-llama-3.1-70b-instruct_refactored_dataset.json\")\n",
+ "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_meta-llama-3.1-70b-instruct_refactored_dataset.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages = [[averages1, averages2, 'Vanilla LLaMA-3.1-70B', 'gainsboro']]\n",
+ "\n",
+ "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_mistral-large-instruct_refactored_dataset.json\")\n",
+ "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_mistral-large-instruct_refactored_dataset.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'Vanilla Mistral-Large', 'gainsboro']]\n",
+ "\n",
+ "dataset1 = load_data(\"dataset/BioASQ/BioASQ_test_qwen2.5-72b-instruct_refactored_dataset.json\")\n",
+ "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn_test_qwen2.5-72b-instruct_refactored_dataset.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'Vanilla Qwen2.5-72B', 'gainsboro']]\n",
+ "\n",
+ "dataset1 = load_data(\"dataset/BioASQ/BioASQ-test-refactored-dataset.json\")\n",
+ "dataset2 = load_data(\"dataset/ORKG-Synthesis/llm4syn-test-refactored-dataset.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'Vanilla LLaMA-3.1-8B', 'teal']]\n",
+ "\n",
+ "dataset1 = load_data(\"assets/sft-bioasq-org-test.json\")\n",
+ "dataset2 = load_data(\"assets/sft-orkg-synthesis-org-test.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'SFT (benign)', 'orange']]\n",
+ "\n",
+ "dataset1 = load_data(\"assets/rlhf-bioasq-adv-test.json\")\n",
+ "dataset2 = load_data(\"assets/rlhf-orkg-synthesis-adv-test.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'SFT (benign) + RL (adversarial)', 'tomato']]\n",
+ "\n",
+ "dataset1 = load_data(\"assets/rlhf-bioasq-adv-org-test.json\")\n",
+ "dataset2 = load_data(\"assets/rlhf-orkg-synthesis-adv-org-test.json\")\n",
+ "averages1 = compute_averages(dataset1, criteria)\n",
+ "averages2 = compute_averages(dataset2, criteria)\n",
+ "averages += [[averages1, averages2, 'SFT (benign) + RL (benign + adversarial)', 'yellowgreen']]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "4240a124-7f9c-47a3-a832-10252ba0e02b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def plot_results(averages, criteria, dataset_name1, dataset_name2, font_size=12):\n",
+ " criteria_abbreviations = {\n",
+ " \"Coherence\": \"Cohr\",\n",
+ " \"Cohesion\": \"Cohs\",\n",
+ " \"Completeness\": \"Comp\",\n",
+ " \"Conciseness\": \"Conc\",\n",
+ " \"Correctness\": \"Corr\",\n",
+ " \"Informativeness\": \"Info\",\n",
+ " \"Integration\": \"Integ\",\n",
+ " \"Readability\": \"Read\",\n",
+ " \"Relevancy\": \"Relv\"\n",
+ " }\n",
+ " fig, axes = plt.subplots(2, 3, figsize=(10, 5), sharey=True) \n",
+ " eval_types = [\"original\", \"extreme\", \"subtle\"]\n",
+ " dataset_names = [dataset_name1, dataset_name2]\n",
+ " legend_handles = [] \n",
+ " legend_labels = [] \n",
+ " for average in averages:\n",
+ " avg = [average[0], average[1]] \n",
+ " title = average[2]\n",
+ " color = average[3]\n",
+ " \n",
+ " markers = {'Vanilla LLaMA-3.1-70B':'*', 'Vanilla Mistral-Large':'+', 'Vanilla Qwen2.5-72B':'^'}\n",
+ "\n",
+ " marker= markers.get(title, 'o')\n",
+ " for i, dataset in enumerate(avg):\n",
+ " for j, eval_type in enumerate(eval_types):\n",
+ " ax = axes[i, j] \n",
+ " line, = ax.plot(criteria, \n",
+ " [dataset[eval_type][c] for c in criteria], \n",
+ " marker=marker, \n",
+ " linestyle='-', \n",
+ " linewidth=0.9,\n",
+ " color=color,\n",
+ " markersize=font_size-5)\n",
+ " ax.set_title(f\"{dataset_names[i]} - {eval_type.capitalize() if eval_type!='original' else 'Benign'}\", \n",
+ " fontsize=font_size)\n",
+ " ax.tick_params(axis='both', labelsize=font_size - 3)\n",
+ " ax.grid(True, linestyle=\"--\", alpha=0.15)\n",
+ " # ax.set_yticks([1, 2, 3, 4, 5])\n",
+ " ax.set_xticklabels([criteria_abbreviations[crit] for crit in criteria], rotation=0)\n",
+ " legend_handles.append(line)\n",
+ " legend_labels.append(title)\n",
+ " \n",
+ " fig.legend(legend_handles, legend_labels, loc=\"upper center\", ncol=7, fontsize=font_size-3, bbox_to_anchor=(0.5, 0.97))\n",
+ " plt.tight_layout(rect=[0, 0, 1, 0.95])\n",
+ " plt.savefig(\"images/results_plot.pdf\", format=\"pdf\", bbox_inches=\"tight\")\n",
+ " plt.savefig(\"images/results_plot.png\", format=\"png\", dpi=300, bbox_inches=\"tight\")\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8ac306d9-1c34-4cd6-94ab-b26499b1e45e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plot_results(averages, criteria, \"BioASQ\", \"ORKGSynthesis\", font_size=8)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a052a3da-e40b-49be-bcc7-980ef375eb92",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.16"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000..f92d965
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,35 @@
+[tool.poetry]
+name = "YESciEval"
+
+version = "0.2.0"
+
+description = "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering."
+authors = ["Hamed Babaei Giglou "]
+license = "MIT License"
+readme = "README.md"
+homepage = "https://yescieval.readthedocs.io/"
+repository = "https://github.com/sciknoworg/YESciEval/"
+include = ["images/logo.png"]
+
+[tool.poetry.dependencies]
+python = ">=3.10,<4.0.0"
+pre-commit="*"
+transformers="*"
+torch="*"
+peft="*"
+openai="*"
+pandas="*"
+numpy="*"
+pydantic="*"
+
+[tool.poetry.dev-dependencies]
+ruff = "*"
+pre-commit = "*"
+setuptools = "*"
+wheel = "*"
+twine = "*"
+pytest = "*"
+
+[build-system]
+requires = ["poetry-core>=1.0.0"]
+build-backend = "poetry.core.masonry.api"
diff --git a/readthedocs.yml b/readthedocs.yml
new file mode 100644
index 0000000..ea88004
--- /dev/null
+++ b/readthedocs.yml
@@ -0,0 +1,23 @@
+version: "2"
+
+
+build:
+
+ os: "ubuntu-22.04"
+ tools:
+ python: "3.10"
+
+python:
+ install:
+ - method: pip
+ path: .
+ - requirements: docs/requirements.txt
+ - requirements: requirements.txt
+
+sphinx:
+ builder: html
+ configuration: docs/source/conf.py
+
+submodules:
+ include: all
+ recursive: true
diff --git a/requirements.txt b/requirements.txt
index 778f31c..0478557 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,12 +1,9 @@
pre-commit
-scikit-learn
-python-dotenv
transformers
torch
peft
-tqdm
openai
pandas
+numpy
pydantic
-trl
-datasets
\ No newline at end of file
+pytest
\ No newline at end of file
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000..e74ca6c
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,41 @@
+from setuptools import setup, find_packages
+
+with open("README.md", encoding="utf-8") as f:
+ long_description = f.read()
+
+setup(
+ name="YESciEval",
+ version="0.2.0",
+ author="Hamed Babaei Giglou",
+ author_email="hamedbabaeigiglou@gmail.com",
+ description="YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering.",
+ long_description=long_description,
+ long_description_content_type="text/markdown",
+ url="https://github.com/sciknoworg/YESciEval",
+ packages=find_packages(),
+ install_requires=[
+ "pre-commit",
+ "transformers,"
+ "torch",
+ "peft",
+ "openai",
+ "pandas",
+ "numpy",
+ "pydantic",
+ "pytest"
+ ],
+ classifiers=[
+ "Development Status :: 5 - Production/Stable",
+ "Intended Audience :: Developers",
+ "Topic :: Software Development :: Libraries :: Python Modules",
+ "Programming Language :: Python :: 3",
+ "License :: OSI Approved :: MIT License",
+ "Operating System :: OS Independent",
+ ],
+ python_requires=">=3.10,<4.0.0",
+ project_urls={
+ "Documentation": "https://yescieval.readthedocs.io/",
+ "Source": "https://github.com/sciknoworg/YESciEval",
+ "Tracker": "https://github.com/sciknoworg/YESciEval/issues",
+ },
+)
diff --git a/test.py b/test.py
deleted file mode 100644
index 3f0f509..0000000
--- a/test.py
+++ /dev/null
@@ -1,11 +0,0 @@
-from yescieval.rubric import Informativeness
-
-
-papers = {
- "A Study on AI": "This paper discusses recent advances in AI.",
- "Machine Learning Basics": "An overview of supervised learning methods."
-}
-question = "this is a dume question"
-synthesis="synthesis answer"
-rubric = Informativeness(papers=papers, question=question, synthesis=synthesis)
-print(rubric.instruct())
diff --git a/test/test_rubrics.py b/test/test_rubrics.py
new file mode 100644
index 0000000..c903fba
--- /dev/null
+++ b/test/test_rubrics.py
@@ -0,0 +1,21 @@
+import unittest
+from yescieval import Informativeness
+
+class TestRubric(unittest.TestCase):
+
+ def setUp(self):
+ self.papers = {
+ "A Study on AI": "This paper discusses recent advances in AI.",
+ "Machine Learning Basics": "An overview of supervised learning methods."
+ }
+ self.question = "this is a dume question"
+ self.answer = "synthesis answer"
+
+ def test_informativeness(self):
+ rubric = Informativeness(papers=self.papers, question=self.question, answer=self.answer)
+ output = rubric.instruct()
+ self.assertIsInstance(output, list)
+ self.assertTrue(len(output) > 0)
+
+if __name__ == '__main__':
+ unittest.main()
diff --git a/yescieval/__init__.py b/yescieval/__init__.py
index 2e60550..9e161b4 100644
--- a/yescieval/__init__.py
+++ b/yescieval/__init__.py
@@ -1,23 +1,9 @@
-__version__ = "0.1.0"
+__version__ = "0.2.0"
from .base import Rubric, Parser
from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
Integration, Cohesion, Readability, Conciseness)
-
-
-__all__ = [
- "Rubric",
- "Informativeness",
- "Correctness",
- "Completeness",
- "Coherence",
- "Relevancy",
- "Integration",
- "Cohesion",
- "Readability",
- "Conciseness",
- "Parser"
-]
-
+from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
+from .parser import GPTParser
diff --git a/yescieval/base/__init__.py b/yescieval/base/__init__.py
index 7313318..7c07516 100644
--- a/yescieval/base/__init__.py
+++ b/yescieval/base/__init__.py
@@ -1,9 +1,10 @@
from .rubric import Rubric
-from .parser import Parser
+from .parser import Parser, RubricLikertScale
from .judge import Judge
__all__ = [
"Rubric",
"Parser",
+ "RubricLikertScale",
"Judge"
]
\ No newline at end of file
diff --git a/yescieval/base/judge.py b/yescieval/base/judge.py
index ed6a6aa..5ef75ed 100644
--- a/yescieval/base/judge.py
+++ b/yescieval/base/judge.py
@@ -1,11 +1,16 @@
from abc import ABC
-from typing import Dict
+from typing import Dict, Any
from . import Parser, Rubric
class Judge(ABC):
- def from_pretrained(self, model_id:str, device: str="auto"):
+
+ def from_pretrained(self, model_id:str, device: str="auto", token:str =""):
+ self.model, self.tokenizer = self._from_pretrained(model_id=model_id, device=device, token=token)
+
+ def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
pass
- def judge(self, rubric: Rubric, parser: Parser = Parser) -> Dict[str, Dict[str, str]]:
+ def _from_pretrained(self, model_id: str, device: str = "auto", token: str = "") -> [Any, Any]:
pass
+
diff --git a/yescieval/base/rubric.py b/yescieval/base/rubric.py
index ec20d16..64c37e7 100644
--- a/yescieval/base/rubric.py
+++ b/yescieval/base/rubric.py
@@ -12,10 +12,10 @@ class Rubric(BaseModel, ABC):
system_prompt_template: str
papers: Dict[str, str]
question: str
- synthesis: str
+ answer: str
user_prompt_template: str = ("Evaluate and rate the quality of the following scientific synthesis "
"according to the characteristics given in the system prompt.\n"
- "\n{synthesis}\n"
+ "\n{answer}\n"
"\n{question}\n"
"\n\n{content}\n\n###")
@@ -26,7 +26,7 @@ def render_papers(self) -> str:
return paper_content
def verbalize(self):
- return self.user_prompt_template.format(synthesis=self.synthesis,
+ return self.user_prompt_template.format(answer=self.answer,
question=self.question,
content=self.render_papers())
diff --git a/yescieval/judge/__init__.py b/yescieval/judge/__init__.py
index e69de29..a3fe787 100644
--- a/yescieval/judge/__init__.py
+++ b/yescieval/judge/__init__.py
@@ -0,0 +1,8 @@
+from .judges import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
+
+__all__ = [
+ "AutoJudge",
+ "AskAutoJudge",
+ "BioASQAutoJudge",
+ "CustomAutoJudge"
+]
\ No newline at end of file
diff --git a/yescieval/judge/judges.py b/yescieval/judge/judges.py
index 4a061cf..2c436f0 100644
--- a/yescieval/judge/judges.py
+++ b/yescieval/judge/judges.py
@@ -1,10 +1,68 @@
-from ..base import Judge, Parser, Rubric
+from ..base import Judge, Rubric
from typing import Dict
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel, PeftConfig
+import torch
+
+
+
class AutoJudge(Judge):
- def from_pretrained(self, model_id:str, device:str="auto"):
- pass
+ def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
+ config = PeftConfig.from_pretrained(model_id)
+ base_model_name = config.base_model_name_or_path
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name,
+ padding_side="left",
+ token=token)
+ tokenizer.pad_token = tokenizer.eos_token
+ base_model = AutoModelForCausalLM.from_pretrained(
+ base_model_name,
+ torch_dtype=torch.float32,
+ device_map=device,
+ token=token
+ )
+ model = PeftModel.from_pretrained(base_model, model_id)
+ return model, tokenizer
+
+ def evaluate(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
+ inputs = self.tokenizer.apply_chat_template(rubric.instruct(),
+ add_generation_prompt=True,
+ return_dict=True,
+ return_tensors="pt")
+ inputs.to(self.model.device)
+ outputs = self.model.generate(**inputs,
+ max_new_tokens=max_new_tokens,
+ pad_token_id=self.tokenizer.eos_token_id)
+ evaluation = self.tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
+ return evaluation
+
+
+class AskAutoJudge(AutoJudge):
+ def from_pretrained(self, model_id:str="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
+ device:str="auto",
+ token:str =""):
+ return super()._from_pretrained(model_id=model_id, device=device, token=token)
+
+class BioASQAutoJudge(AutoJudge):
+ def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B",
+ device: str = "auto",
+ token: str = ""):
+ return super()._from_pretrained(model_id=model_id, device=device, token=token)
+
+
+
+class CustomAutoJudge(AutoJudge):
- def judge(self, rubric: Rubric, parser: Parser=Parser) -> Dict[str, Dict[str, str]]:
- pass
+ def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
+ tokenizer = AutoTokenizer.from_pretrained(model_id,
+ padding_side="left",
+ token=token)
+ tokenizer.pad_token = tokenizer.eos_token
+ model = AutoModelForCausalLM.from_pretrained(
+ model_id,
+ torch_dtype=torch.float32,
+ device_map=device,
+ token=token
+ )
+ return model, tokenizer
diff --git a/yescieval/parser/__init__.py b/yescieval/parser/__init__.py
index e69de29..40ec79e 100644
--- a/yescieval/parser/__init__.py
+++ b/yescieval/parser/__init__.py
@@ -0,0 +1,3 @@
+from .parsers import GPTParser
+
+__all__ = ["GPTParser"]
\ No newline at end of file
diff --git a/yescieval/parser/parsers.py b/yescieval/parser/parsers.py
new file mode 100644
index 0000000..7873375
--- /dev/null
+++ b/yescieval/parser/parsers.py
@@ -0,0 +1,61 @@
+
+from ..base import Parser, RubricLikertScale
+import time
+from openai import OpenAI
+
+class GPTParser(Parser):
+ """
+ Abstract base class for parsing model outputs into structured characteristic evaluations.
+
+ Each characteristic maps to a CharacteristicScore with a rating and rationale.
+ """
+ def __init__(self, openai_key:str, parser_model:str="gpt-4o-mini"):
+ self.client = OpenAI(api_key=openai_key)
+ self.parser_model = parser_model
+
+ def parse(self, raw_output: str) -> RubricLikertScale:
+ """
+ Parse the raw model output into structured characteristic evaluations.
+
+ Args:
+ raw_output (str): The text generated by the model.
+
+ Returns:
+ Dict[str, CharacteristicScore]: Mapping from characteristic name to its score and rationale.
+ """
+ functions = [
+ {
+ "name": "evaluate_characteristic",
+ "description": "Extracting the exact `rating` and `rationale` from the given text.",
+ "parameters": {
+ "type": "object",
+ "properties": {
+ "rating": {
+ "type": "number",
+ "description": "A numerical rating assigned to the characteristic in the text.",
+ "minimum": 1,
+ "maximum": 5
+ },
+ "rationale": {
+ "type": "string",
+ "description": "The explanation for the assigned rating."
+ }
+ },
+ "required": ["rating", "rationale"]
+ }
+ }
+ ]
+ while True:
+ try:
+ completion = self.client.chat.completions.create(
+ model=self.parser_model,
+ messages=[{"role": "user", "content": raw_output}],
+ functions=functions
+ )
+ parsed_output = eval(completion.choices[0].message.function_call.arguments)
+ break
+ except Exception as e:
+ print(f"Error {e}")
+ time.sleep(3)
+
+ return RubricLikertScale(rating=parsed_output['rating'], rationale=parsed_output['rationale'])