Skip to content
/ Kastor Public
forked from datalogism/Kastor

Knowledge shape extractor pipeline for text-to-graph knowledge base infusion and completion

License

Notifications You must be signed in to change notification settings

Wimmics/Kastor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kastor - Shape-based relation extraction framework

kstor

Kastor is a modular framework for extracting RDF triples from unstructured text using shape-aware SLMs (Small Language Models). By combining SHACL shape definitions, a distilled knowledge graph, and active fine-tuning, Kastor builds lightweight, task-specific extractors. It's ideal for applications in semantic web, knowledge graph construction, and structured data mining.

🚀 Quick Start

1. Clone and Setup

git clone https://github.com/datalogism/Kastor.git
cd Kastor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

📁 Project Overview

Kastor/
├── corese/           # Corese RDF engine and knowledge base loader
├── kstor/            # Knowledge distillation and SHACL-based filtering
├── slm/              # Finetuning material
├── shapes/           # SHACL templates used for extraction
├── XP_results/       # Experimental outputs
├── doc/              # Documentation
├── img/              # Illustrations
└── README.md         # This file

🧠 How It Works

  1. Knowledge Base init.Initialize your KB with DBpedia data
  2. Shape DefinitionDescribe your desired RDF structure in a SHACL shape file.
  3. Knowledge DistillationFilter and align text and RDF from a knowledge base using the SHACL shape.
  4. Data AugmentationAugment your knowledge base to ensure sufficient exposure of rare properties
  5. SLM TrainingTrain a language model distilled and enrich models to learn text-to-RDF extractor
  6. Light Active LearningUse your models to create gold dataset
  7. Testing & InferenceUse the trained model to extract RDF triples from new text

🛠 Requirements

  • Python >= 3.8
  • PyTorch
  • HuggingFace Transformers
  • RDFlib
  • Java 11+ (for Corese)

Install via pip install -r requirements.txt


✅ Best Practices

  • Use concise, complete SHACL definitions to improve distillation quality.
  • Visualize RDF outputs to validate structure.
  • Use active training for iterative improvement.
  • Pre-filter knowledge base to reduce noise.

📜 License

Kastor is released under the MIT License.


📬 Questions or Issues?

Open a GitHub issue or contact the maintainers via https://github.com/datalogism/Kastor


📝 Related publications

1- Kastor: Fine-Tuned Small Language Models for Shape-Based Active Relation Extraction [PUBLISHED]

🎉 Accepted at the Research Track of ESWC 2025

If you use the code or cite our work, please reference this one as follows :

@inproceedings{DBLP:conf/esws/RingwaldGFMA25,
  author       = {C{\'{e}}lian Ringwald and
                  Fabien Gandon and
                  Catherine Faron and
                  Franck Michel and
                  Hanna Abi Akl},
  editor       = {Edward Curry and
                  Maribel Acosta and
                  Mar{\'{\i}}a Poveda{-}Villal{\'{o}}n and
                  Marieke van Erp and
                  Adegboyega K. Ojo and
                  Katja Hose and
                  Cogan Shimizu and
                  Pasquale Lisena},
  title        = {Kastor: Fine-Tuned Small Language Models for Shape-Based Active Relation
                  Extraction},
  booktitle    = {The Semantic Web - 22nd European Semantic Web Conference, {ESWC} 2025,
                  Portoroz, Slovenia, June 1-5, 2025, Proceedings, Part {I}},
  series       = {Lecture Notes in Computer Science},
  volume       = {15718},
  pages        = {94--115},
  publisher    = {Springer},
  year         = {2025},
  url          = {https://doi.org/10.1007/978-3-031-94575-5\_6},
  doi          = {10.1007/978-3-031-94575-5\_6},
  timestamp    = {Tue, 10 Jun 2025 17:38:39 +0200},
  biburl       = {https://dblp.org/rec/conf/esws/RingwaldGFMA25.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Associated material:

The resulting extractor could be tested using this notebook

2- Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties [UNDER-REVIEW]

About

Knowledge shape extractor pipeline for text-to-graph knowledge base infusion and completion

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.9%
  • Shell 7.1%
  • Jupyter Notebook 1.0%