Kastor is a modular framework for extracting RDF triples from unstructured text using shape-aware SLMs (Small Language Models). By combining SHACL shape definitions, a distilled knowledge graph, and active fine-tuning, Kastor builds lightweight, task-specific extractors. It's ideal for applications in semantic web, knowledge graph construction, and structured data mining.
git clone https://github.com/datalogism/Kastor.git
cd Kastor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtKastor/
├── corese/ # Corese RDF engine and knowledge base loader
├── kstor/ # Knowledge distillation and SHACL-based filtering
├── slm/ # Finetuning material
├── shapes/ # SHACL templates used for extraction
├── XP_results/ # Experimental outputs
├── doc/ # Documentation
├── img/ # Illustrations
└── README.md # This file
- Knowledge Base init. — Initialize your KB with DBpedia data
- Shape Definition — Describe your desired RDF structure in a SHACL shape file.
- Knowledge Distillation — Filter and align text and RDF from a knowledge base using the SHACL shape.
- Data Augmentation — Augment your knowledge base to ensure sufficient exposure of rare properties
- SLM Training — Train a language model distilled and enrich models to learn text-to-RDF extractor
- Light Active Learning — Use your models to create gold dataset
- Testing & Inference — Use the trained model to extract RDF triples from new text
- Python >= 3.8
- PyTorch
- HuggingFace Transformers
- RDFlib
- Java 11+ (for Corese)
Install via pip install -r requirements.txt
- Use concise, complete SHACL definitions to improve distillation quality.
- Visualize RDF outputs to validate structure.
- Use active training for iterative improvement.
- Pre-filter knowledge base to reduce noise.
Kastor is released under the MIT License.
Open a GitHub issue or contact the maintainers via https://github.com/datalogism/Kastor
🎉 Accepted at the Research Track of ESWC 2025
If you use the code or cite our work, please reference this one as follows :
@inproceedings{DBLP:conf/esws/RingwaldGFMA25,
author = {C{\'{e}}lian Ringwald and
Fabien Gandon and
Catherine Faron and
Franck Michel and
Hanna Abi Akl},
editor = {Edward Curry and
Maribel Acosta and
Mar{\'{\i}}a Poveda{-}Villal{\'{o}}n and
Marieke van Erp and
Adegboyega K. Ojo and
Katja Hose and
Cogan Shimizu and
Pasquale Lisena},
title = {Kastor: Fine-Tuned Small Language Models for Shape-Based Active Relation
Extraction},
booktitle = {The Semantic Web - 22nd European Semantic Web Conference, {ESWC} 2025,
Portoroz, Slovenia, June 1-5, 2025, Proceedings, Part {I}},
series = {Lecture Notes in Computer Science},
volume = {15718},
pages = {94--115},
publisher = {Springer},
year = {2025},
url = {https://doi.org/10.1007/978-3-031-94575-5\_6},
doi = {10.1007/978-3-031-94575-5\_6},
timestamp = {Tue, 10 Jun 2025 17:38:39 +0200},
biburl = {https://dblp.org/rec/conf/esws/RingwaldGFMA25.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
The resulting extractor could be tested using this notebook
