Skip to content

Latest commit

 

History

History
365 lines (283 loc) · 25.1 KB

README.md

File metadata and controls

365 lines (283 loc) · 25.1 KB

SeMi - SEmantic Modeling machIne

SeMi (SEmantic Modeling machIne) is a tool to semi-automatically build large-scale Knowledge Graphs from structured sources such as CSV, JSON, and XML files. To achieve such a goal, SeMi builds the semantic models of the data sources, in terms of concepts and relations within a domain ontology. Most of the research contributions on automatic semantic modeling is focused on the detection of semantic types of source attributes. However, the inference of the correct semantic relations between these attributes is critical to reconstruct the precise meaning of the data. SeMi covers the entire process of semantic modeling:

  1. it provides a semi-automatic step to detect semantic types;
  2. it exploits a novel approach to inference semantic relations, based on a graph neural network trained on background linked data.

Semantic models can be formalized as graphs, where leaf nodes represent the attributes of the data source and the other nodes and relationships are defined by the ontology.

Considering the following JSON file in the public procurement domain:

{           
   "contract_id": "Z4ADEA9DE4",
   "contract_object": "Excavations",
   "proponent_struct": {
     "business_id": "80004990927",
     "business_name": "municipality01"
 },
 "participants":
 [
  {
    "business_id": "08106710158",
    "business_name": "company01"
  }
 ]
}

And consider the following domain ontology related to public procurement:

Domain Ontology

the resulting semantic model is:

Semantic Model

Requirements

Before installing SeMi, you need to check the following requirements.

Download

To download SeMi, you can run the commands available here.

Set-up

To install SeMi, you can use the following instructions.

Step-by-step Semantic Models Generation

Using the following scripts, you can generate a semantic model starting from an target source and a domain ontology.

Semantic Types

Semantic types (or semantic labels) consist of a combination of an ontology class and an ontology data property. To perform the semantic types detection process you need to execute two different scripts. The first script is the following:

$ node run/semantic_label_indexer.js pc data/pc/input/
  • pc is the Elasticsearch index name.
  • data/pc/input/ is the input folder containing files that have to be indexed.

This step is necessary to create the Elasticsearch index used as reference to detect the semantic types. The second script is the following:

$ node run/semantic_label.js pc data/pc/input/Z4ADEA9DE4.json data/pc/semantic_types/Z4ADEA9DE4_st_auto.json

In SeMi, we consider the semantic types detection as a semi-automatic task.

For this reason, the manual-refined version of the semantic type is available in the file:

  • data/pc/semantic_types/Z4ADEA9DE4_st.json

Below an image that represents semantic types.

Semantic Types

Multi-edge and Weighted Graph (MEWG)

The Multi-edge and Weighted Graph (MEWG) includes all plausible semantic models of a data source based on a domain ontology. To create such graph, you can run the following commands:

$ node run/graph.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/ontology/ontology.ttl rdfs:domain rdfs:range owl:Class data/pc/semantic_models/Z4ADEA9DE4
  • data/pc/semantic_types/Z4ADEA9DE4_st.json is the input semantic type file.
  • data/pc/ontology/ontology.ttl is the domain ontology file.
  • rdfs:domain is the domain property in the ontology.
  • rdfs:range is the range property in the ontology.
  • owl:Class is the property in the ontology to identify classes.
  • data/pc/semantic_models/Z4ADEA9DE4 is used as output path for the generation of the graph in different formats.

This script generates two types of graph:

Below an image that represents the MEWG:

Multi-edge and Weighted Graph

Steiner Tree

To create the Steiner Tree on the MEWG: you can run the following command:

$ node run/steiner_tree.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4

This script generates two types of steiner trees:

Below an image that represents a steiner tree.

Steiner Tree

Initial Semantic Model

For the automatic generation of the semantic model, you can run the following command:

$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4

Below an example of the semantic model serialized using SPARQL and JARQL syntax:

CONSTRUCT {
    ?Contract0 dcterms:identifier ?cig.
    ?Contract0 rdf:type pc:Contract.
    ?Contract0 rdfs:description ?oggetto.
    ?Contract0 rdf:type pc:Contract.
    ?BusinessEntity0 dcterms:identifier ?strutturaProponente__codiceFiscaleProp.
    ?BusinessEntity0 rdf:type gr:BusinessEntity.
    ?BusinessEntity1 dcterms:identifier ?partecipanti__identificativo.
    ?BusinessEntity1 rdf:type gr:BusinessEntity.
    ?BusinessEntity1 rdfs:label ?partecipanti__ragioneSociale.
    ?BusinessEntity1 rdf:type gr:BusinessEntity.
    ?BusinessEntity1 dcterms:identifier ?aggiudicatari__identificativo.
    ?BusinessEntity1 rdf:type gr:BusinessEntity.
    ?BusinessEntity1 rdfs:label ?aggiudicatari__ragioneSociale.
    ?BusinessEntity1 rdf:type gr:BusinessEntity.
    ?Contract0 pc:contractingAuthority ?BusinessEntity0.
    ?Contract0 pc:contractingAuthority ?BusinessEntity1.
}
WHERE {
    ?root a jarql:Root.
    OPTIONAL { ?root jarql:cig ?cig. }
    OPTIONAL { ?root jarql:oggetto ?oggetto. }
    OPTIONAL { ?root jarql:strutturaProponente ?strutturaProponente. }
    OPTIONAL { ?strutturaProponente jarql:codiceFiscaleProp ?strutturaProponente__codiceFiscaleProp. }
    OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
    OPTIONAL { ?partecipanti jarql:identificativo ?partecipanti__identificativo. }
    OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
    OPTIONAL { ?partecipanti jarql:ragioneSociale ?partecipanti__ragioneSociale. }
    OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
    OPTIONAL { ?aggiudicatari jarql:identificativo ?aggiudicatari__identificativo. }
    OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
    OPTIONAL { ?aggiudicatari jarql:ragioneSociale ?aggiudicatari__ragioneSociale. }
    BIND (URI(CONCAT('http://purl.org/procurement/public-contracts/contract/',?cig)) as ?Contract0)
    BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?strutturaProponente__codiceFiscaleProp)) as ?BusinessEntity0)
    BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?partecipanti__identificativo)) as ?BusinessEntity1)
}

KG Generation Through the Initial Semantic Model

In order to create the KG resulting from the initial semantic model, you have to run the JARQL tool with the following command:

$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4.query > data/pc/output/Z4ADEA9DE4.ttl

Below an example of the generated RDF file:

<http://purl.org/procurement/public-contracts/contract/Z4ADEA9DE4>
        <http://purl.org/dc/terms/identifier>
                "Z4ADEA9DE4"^^<http://www.w3.org/2001/XMLSchema#string> ;
        <http://purl.org/procurement/public-contracts#contractingAuthority>
                <http://purl.org/goodrelations/v1/businessentity/03382820920> , <http://purl.org/goodrelations/v1/businessentity/80004990927> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
                <http://purl.org/procurement/public-contracts#Contract> ;
        <http://www.w3.org/2000/01/rdf-schema#description>
                "C.E. 23 Targa E9688 ( RIP.OFF.PRIVATE ) MANUTENZIONE ORDINARIA MEZZI DI TRASPORTO"^^<http://www.w3.org/2001/XMLSchema#string> .

<http://purl.org/goodrelations/v1/businessentity/03382820920>
        <http://purl.org/dc/terms/identifier>
                "03382820920"^^<http://www.w3.org/2001/XMLSchema#string> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
                <http://purl.org/goodrelations/v1#BusinessEntity> ;
        <http://www.w3.org/2000/01/rdf-schema#label>
                "CAR WASH CARALIS DI PUSCEDDU GRAZIANO   C  S N C"^^<http://www.w3.org/2001/XMLSchema#string> .

<http://purl.org/goodrelations/v1/businessentity/80004990927>
        <http://purl.org/dc/terms/identifier>
                "80004990927"^^<http://www.w3.org/2001/XMLSchema#string> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
                <http://purl.org/goodrelations/v1#BusinessEntity> .

Issues Related to the Initial Semantic Model

The approach for generating the initial semantic model has a main limit: the steiner tree within the graph includes the shortest path to connect semantic type classes, however it does not necessarily express the correct semantic description of the target source. For this reason, a refinement process is required in order to identify a more accurate semantic model.

Semantic Model Refinement

The semantic model refinement requires to prepare the training, the test, and the validation datasets as input of the deep learning model. Such model is a graph neural network and its main goal is to reconstruct the linked data edges using the latent representation of entities and properties. The architecture of the graph neural network is an auto-encoder composed of:

The training, the test, and the validation datasets are built splitting a linked data repository (background knowledge) that is built through the semantic models defined by the domain experts on various sources, which are similar to the target source.

In our example, the input sources are available in the data/pc/input folder and the ground-truth semantic model is available in the semi/data/learning_datasets/pc.query file.

The background linked data is available in the data/pc/learning_datasets/complete.ttl file. This background knowledge is then splitted in the following datasets:

  • the training dataset available in the data/pc/learning_datasets/training.ttl file;
  • the validation dataset available in the data/pc/learning_datasets/valid.ttl file;
  • the test dataset available in the data/pc/learning_datasets/test.ttl file.

Graph Neural Network Training

For the graph neural network training, you can launch the following script:

python src/link_prediction/link_predict.py --directory data/pc/learning_datasets/  --train data/pc/learning_datasets/training.ttl --valid data/pc/learning_datasets/valid.ttl --test data/pc/learning_datasets/test.ttl --score pc --parser PC --gpu 0 --graph-batch-size 1000 --n-hidden 100 --graph-split-size 1
  • --directory data/pc/learning_datasets/ is the directory in which entity and property dictionaries are stored. In addition, this directory stores also the trained model with its related outputs..
  • --train data/pc/learning_datasets/training.ttl is the file containing the training facts.
  • --valid data/pc/learning_datasets/valid.ttl is the file containing the validation facts.
  • --test data/pc/learning_datasets/test.ttl is the file containing the test facts.
  • --score pc is the subdirectory in which the scores resulting from the training and the evaluation process will be stored.
  • --parser PC is the parameter to drive the construction of the dictionaries of entities and relationships.
  • --gpu 0 is the parameter to establish how many GPUs (if available) can be used to train the model.
  • --graph-batch-size 1000 is a parameter to indicate the number of edges extracted at each step with the graph sampling process.
  • --n-hidden 100 is an hyperparameter of the model to define the number of neurons (and consequently the dimension of the embeddings) at each network layer.
  • --graph-split-size 1 is a parameter to establish the portion of edges used as positive examples.

The outputs of the training stage are the following:

Weights Refinement of the MEWG

The goal of this stage to refine the edge weights of the MEWG exploiting embedding obtained from the graph neural netwrk training. In this way, we incorporate the information from the background knowledge, in order to improve the accuracy of the semantic model.

The first step is to produce the JARQL representation of the MEWG:

$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_plausible

Then, you can proceed with the refinement process with the following command:

node run/refinement.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/model_datasets/scores/pc/6000/score.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4
  • data/pc/semantic_types/Z4ADEA9DE4_st.json is the semantic type file.
  • data/pc/model_datasets/scores/pc/6000/score.json is the score file generated during the training at the epoch 6000
  • data/pc/semantic_models/Z4ADEA9DE4_steiner.json is the beautified version of the initial semantic model file generated through the steiner tree algorithm.
  • data/pc/semantic_models/Z4ADEA9DE4_graph.json is the beautified version of weighted graph file including all plausible semantic models.

This script generates two different outputs:

Below an image that represents the refined semantic model.

Refined Semantic Model

JARQL Serialization of the Refined Semantic Model

For the generation of the refined semantic model serialized in JARQL, you need to run the following command:

node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_refined_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_refined

This script generates as output the following file:

  • data/pc/semantic_models/Z4ADEA9DE4_refined.query is the JARQL serialization of the refined semantic model.

KG Generation from the Refined Semantic Model

In order to create the KG resulting from the refined semantic model, you have to run the JARQL tool with the following command:

$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4_refined.query > data/pc/output/Z4ADEA9DE4_refined.ttl