GitHub - sciknoworg/schema-miner: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

SCHEMA-MINER^pro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

This is an open-source implementation of Schema-Miner^pro.

📋 Schema-miner^pro Overview

Schema-Miner is a novel framework that leverages Large Language Models (LLMs) and continuous human feedback to automate and enhance the schema mining task. Through an iterative process, the framework uses LLMs to extract and organize properties from unstructured text and refines schemas with expert input ESWC Proceedings. Schema-Miner^pro extends Schema-Miner with an ontology grounding component powered by agentic AI. It performs multi-step reasoning using lexical heuristics and semantic similarity search, and grounds schema elements in formal ontologies (e.g., QUDT). Comprehensive documentation for Schema-Miner Pro, including detailed guides and examples, is available at schema-miner.readthedocs.io.

Note

Schema-Miner implements a three-stage pipeline for schema discovery and refinement without ontology grounding (see Figure 1). Schema-Miner Pro extends this pipeline by grounding the discovered schemas to formal ontologies.

Figure 1: Overview of the LLMs4SchemaDiscovery workflow implemented in the SCHEMA-MINER tool. Stage 1 generates an initial process schema using domain specifications, while Stage 2, refines this schema using a small, curated scientific corpus. In Stage 3, schema is further enriched using a larger, non-curated corpus. The final stage involves grounding the properties in formal ontologies.

⚙️ System Requirements

The computational requirements for running this project vary depending on the model being used. If utilizing OpenAI models such as GPT-4o and GPT-4-turbo, no specialized hardware is needed since inference is performed via API calls. A basic system with a stable internet connection is sufficient for executing API-based workflow.

For users opting to run open-source models such as Llama 3.1 8B or other large-scale transformer-based models, local execution demands significantly higher computational resources. While these models can be executed on a CPU, inference times will be considerably longer. However, for efficient execution, a dedicated GPU with VRAM (specified by the model's documentation) is strongly recommended.

While the hardware configuration can be adjusted based on the model size and performance needs, using a GPU significantly accelerates inference processes, reducing execution time drastically compared to CPU-only setups.

Experimental Configuration

For our experiments, we used the following hardware setup:

Processor: 16-core CPU
Memory: 32 GB RAM
GPU: Intel Arc Graphics
Models Used:
- Cloud-based: GPT-4o and GPT-4-turbo (via OpenAI API)
- Locally run: Llama 3.1 8B

🧪 Installation

Install the package directly from PyPI using pip:

pip install schema-miner

If you are working with the source code directly, install dependencies from requirements.txt:

git clone https://github.com/sciknoworg/schema-miner.git
cd schema-miner
pip install -r requirements.txt

🚀 Quick Start

For a quick start, see the provided example notebooks highlighting the overall workflows of the schema-miner.

	Notebook
1	Schema Mining With LLMs and expert Example
2	Schema Ontology Grounding Example

🧑‍💻 Schema-miner^pro Tool Usage

Schema-Miner enables schema discovery and refinement through a 3-stage pipeline (Stage 1 to 3) powered by LLMs, domain expertise, and scientific literature. Schema-Miner^pro extends this pipeline with an automated ontology-grounding component (Stage 4), performing multi-step reasoning and semantic alignment to formal ontologies, while preserving human-in-the-loop validation.

🛠️ Configuration

Before running schema-miner, configure your environment:

from schema_miner.config.envConfig import EnvConfig

# OpenAI Keys
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

# Ollama
EnvConfig.OLLAMA_base_url = '<Ollama Base URL or empty if Ollama running locally>'

# HuggingFace
EnvConfig.HUGGINGFACE_access_token = '<Your huggingface access token>'

📂 Data Setup

Before running schema-miner, we need to tell schema-miner where to find:

Domain specification document(s) → used for initial schema mining
Curated corpus (high-quality literature) → used for refinement
Broader corpus (larger set of papers) → used for final validation

# Process Specification for Stage 1
process_specification_filepath = '../data/stage-1/Atomic-Layer-Deposition/Experimental-Usecase'
process_specification_filename = 'ALD-Process-Development.pdf'

# Small curated corpus of scientific papers
scientific_paper_stage2_dir = '../data/stage-2/Atomic-Layer-Deposition/research-papers/experimental-usecase'

# Bigger corpus of scientific papers
scientific_paper_stage3_dir = '../data/stage-3/Atomic-Layer-Deposition/research-papers/experimental_usecase'

🧪 Process Setup

Initialize the process name and description whose schema has to be extracted

from schema_miner.config.processConfig import ProcessConfig
ProcessConfig.Process_name = "Atomic Layer Deposition"
ProcessConfig.Process_description = "An ALD process involves a series of controlled chemical reactions used to deposit thin films on a surface at an atomic level"

🧩 Stage 1 – Initial Schema Mining

Generate an initial JSON schema from a process specification document using a preferred LLM.

import json
import logging
from pathlib import Path
from schema_miner.pdf_text_extractor import pdf_text_extractor
from schema_miner.schema_extractor.extract_schema import extract_schema_stage1

# Choose LLM
llm_model_name = 'gpt-4o'

# Input process specification
process_specification = pdf_text_extractor(process_specification_filepath, process_specification_filename, return_text = True)

# Extract schema
results_file_path = "./results/stage-1/Atomic-Layer-Deposition/experimental-schema"
schema = extract_schema_stage1(llm_model_name, process_specification, results_file_path, save_schema = True)

🔄 Stage 2 – Preliminary Schema Refinement

Refine the Stage 1 schema using scientific literature and expert feedback.

from schema_miner.schema_extractor.extract_schema import extract_schema_stage2

# Input Initial Schema, Expert Feedback and Scientific Literature
schema = Path("./results/stage-1/Atomic-Layer-Deposition/experimental-schema/gpt-4o.json")
expert_review = Path("./data/stage-2/Atomic-Layer-Deposition/domain-expert-reviews/experimental-usecase/method-1/gpt-4o.txt")
scientific_paper = pdf_text_extractor(scientific_paper_stage2_dir, '1 Groner et al.pdf', return_text = True)

# Refine schema
results_file_path = "./results/stage-2/Atomic-Layer-Deposition/experimental-schema"
schema = extract_schema_stage2(llm_model_name, schema, expert_review, scientific_paper, results_file_path, save_schema = True)

🏁 Stage 3 – Final Schema Refinement

Validate and finalize the schema using a larger corpus of research papers and expert review, ensuring generalizability and semantic robustness.

from schema_miner.schema_extractor.extract_schema import extract_schema_stage3

# Input Schema, Expert Feedback and Scientific Literature
schema = Path("./results/stage-2/Atomic-Layer-Deposition/experimental-schema/gpt-4o.json")
expert_review = Path("./data/stage-3/Atomic-Layer-Deposition/domain-expert-reviews/experimental-usecase/Experiment-1/1a/gpt-4o.txt")
scientific_paper = pdf_text_extractor(scientific_paper_stage3_dir, '1-Mattinen et al.pdf', return_text = True)

# Finalize schema
results_file_path = "./results/stage-3/Atomic-Layer-Deposition/experimental-schema"
schema = extract_schema_stage3(llm_model_name, schema, expert_review, scientific_paper, results_file_path, save_schema = True)

# View Final Schema
logging.info(f"{ProcessConfig.Process_name} Schema:\n{json.dumps(schema, indent=2)}")

🌐 Stage 4 – Ontology Grounding with QUDT

Once a process schema is extracted, it can be semantically grounded using the QUDT (Quantities, Units, Dimensions, and Data Types) Ontology.

The grounding workflow uses either LLM prompting or an agentic LLM approach to align schema fields with QUDT concepts. Following is an example of an agent based qudt grounding.

from schema_miner.ontology_grounding.agentic_qudt_grounding import agentic_qudt_grounding

# Select LLM for grounding
llm_model_name = 'gpt-4o'

# Ground the schema with QUDT Ontology
process_schema = Path('./results/Ideal Schema/Atomic-Layer-Deposition/experimental-ideal-schema.json')
results_file_path = "./results/qudt-grounded/Atomic-Layer-Deposition/experimental-schema"
schema = agentic_qudt_grounding(llm_model_name, process_schema, results_file_path, save_schema = True)

# Display grounded schema
logging.info(f'{ProcessConfig.Process_name} Schema:\n{json.dumps(schema, indent = 2)}')

📚 Citing this Work

If you use this repository in your research or applications, please cite the following paper(s):

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models:

Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, and Erwin Kessels. LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models. In The Semantic Web – ESWC 2025, Springer, Cham, pp. 244–261. https://doi.org/10.1007/978-3-031-94578-6_14

📌 BibTeX

@InProceedings{10.1007/978-3-031-94578-6_14,
  author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Babaei Giglou, Hamed and Rula, Anisa and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
  editor    = {Curry, Edward and Acosta, Maribel and Poveda-Villal{\'o}n, Maria and van Erp, Marieke and Ojo, Adegboyega and Hose, Katja and Shimizu, Cogan and Lisena, Pasquale},
  title     = {LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models},
  booktitle = {The Semantic Web},
  year      = {2025},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {244--261},
  isbn      = {978-3-031-94578-6},
}

SCHEMA-MINER^pro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Bora Karasulu, Sören Auer, Adrie Mackus, and Erwin Kessels. SCHEMA-MINER^pro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow. In Semantic Web Journal. https://www.semantic-web-journal.net/system/files/swj3871.pdf

📌 BibTeX
```
@InProceedings{10.1007/978-3-031-94578-6_14,
  author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
  title     = {SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow},
  journal = {Semantic Web Journal},
  year      = {2025},
}
```

👥 Contact & Contributions

We’d love to hear from you! Whether you're interested in collaborating on Schema-MinerPro or have ideas to extend its capabilities, feel free to reach out:

Collaboration inquiries: Contact Jennifer D'Souza at jennifer.dsouza [at] tib.eu
Development questions or bug reports: Please open an issue right here in the repository or get in touch with the lead developer Sameer Sadruddin at sameer.sadruddin [at] tib.eu

Let’s build better schema-mining tools—together!

📃 License

This work is licensed under a MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
data		data
docs		docs
results		results
schema_miner		schema_miner
tutorials		tutorials
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MAINTENANCE.md		MAINTENANCE.md
README.md		README.md
README_PYPI.md		README_PYPI.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCHEMA-MINER^pro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

📋 Schema-miner^pro Overview

⚙️ System Requirements

Experimental Configuration

🧪 Installation

🚀 Quick Start

🧑‍💻 Schema-miner^pro Tool Usage

🛠️ Configuration

📂 Data Setup

🧪 Process Setup

🧩 Stage 1 – Initial Schema Mining

🔄 Stage 2 – Preliminary Schema Refinement

🏁 Stage 3 – Final Schema Refinement

🌐 Stage 4 – Ontology Grounding with QUDT

📚 Citing this Work

📌 BibTeX

📌 BibTeX

👥 Contact & Contributions

📃 License

About

Uh oh!

Releases 3

Contributors 2

Languages

License

sciknoworg/schema-miner

Folders and files

Latest commit

History

Repository files navigation

SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

📋 Schema-minerpro Overview

⚙️ System Requirements

Experimental Configuration

🧪 Installation

🚀 Quick Start

🧑‍💻 Schema-minerpro Tool Usage

🛠️ Configuration

📂 Data Setup

🧪 Process Setup

🧩 Stage 1 – Initial Schema Mining

🔄 Stage 2 – Preliminary Schema Refinement

🏁 Stage 3 – Final Schema Refinement

🌐 Stage 4 – Ontology Grounding with QUDT

📚 Citing this Work

📌 BibTeX

📌 BibTeX

👥 Contact & Contributions

📃 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors 2

Languages

SCHEMA-MINER^pro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

📋 Schema-miner^pro Overview

🧑‍💻 Schema-miner^pro Tool Usage