mkb
is a library dedicated to knowledge graph embeddings. The purpose of this library is to provide modular tools using PyTorch.
- Table of contents
- πΎ Installation
- β‘οΈ Quickstart
- π Datasets
- π€ Models
- π Negative sampling
- π€ Train your model
- π Evaluation
- π€© Get embeddings
- π Transformers
- π§° Development
- π See also
- π License
You should be able to install and use this library with any Python version above 3.6.
pip install git+https://github.com/raphaelsty/mkb
Load or initialize your dataset as a list of triplets:
train = [
('π¦', 'is a', 'bird'),
('π¦
', 'is a', 'bird'),
('π¦', 'lives in', 'π³'),
('π¦', 'lives in', 'π³'),
('π¦
', 'lives in', 'π'),
('π¦', 'hability', 'fly'),
('π¦
', 'hability', 'fly'),
('π', 'is a', 'mollusc'),
('π', 'is a', 'insect'),
('π', 'is a', 'insect'),
('π', 'lives in', 'π³'),
('π', 'lives in', 'π³'),
('π', 'hability', 'fly'),
('π»', 'is a', 'mammal'),
('πΆ', 'is a', 'mammal'),
('π¨', 'is a', 'mammal'),
('π»', 'lives in', 'π'),
('πΆ', 'lives in', 'π '),
('π±', 'lives in', 'π '),
('π¨', 'lives in', 'π³'),
('π¬', 'lives in', 'π'),
('π³', 'lives in', 'π'),
('π', 'is a', 'marine mammal'),
('π³', 'is a', 'marine mammal'),
]
valid = [
('π¦', 'hability', 'fly'),
('π±', 'is_a', 'mammal'),
('π', 'lives_in', 'π³'),
('π¬', 'is_a', 'marine mammal'),
('π', 'lives_in', 'π'),
('π¦', 'is a', 'bird'),
]
Train your model to make coherent embeddings for each entities and relations of your dataset using a pipeline:
from mkb import datasets
from mkb import models
from mkb import losses
from mkb import sampling
from mkb import evaluation
from mkb import compose
import torch
_ = torch.manual_seed(42)
# Set device = 'cuda' if you own a gpu.
device = 'cpu'
dataset = datasets.Dataset(
train = train,
valid = valid,
batch_size = 24,
)
model = models.RotatE(
entities = dataset.entities,
relations = dataset.relations,
gamma = 3,
hidden_dim = 200,
)
model = model.to(device)
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr = 0.003,
)
negative_sampling = sampling.NegativeSampling(
size = 24,
train_triples = dataset.train,
entities = dataset.entities,
relations = dataset.relations,
seed = 42,
)
validation = evaluation.Evaluation(
true_triples = dataset.true_triples,
entities = dataset.entities,
relations = dataset.relations,
batch_size = 8,
device = device,
)
pipeline = compose.Pipeline(
epochs = 100,
eval_every = 50,
early_stopping_rounds = 3,
device = device,
)
pipeline = pipeline.learn(
model = model,
dataset = dataset,
evaluation = validation,
sampling = negative_sampling,
optimizer = optimizer,
loss = losses.Adversarial(alpha=1)
)
Plot embeddings:
from sklearn import manifold
from sklearn import cluster
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
emojis_tokens = {
'π¦': 'duck',
'π¦
': 'eagle',
'π¦': 'owl',
'π': 'snail',
'π': 'ant',
'π': 'bee',
'π»': 'bear',
'πΆ': 'dog',
'π¨': 'koala',
'π±': 'cat',
'π¬': 'dolphin',
'π³': 'whale',
'π': 'humpback whale',
}
embeddings = pd.DataFrame(model.embeddings['entities']).T.reset_index()
embeddings = embeddings[embeddings['index'].isin(emojis_tokens)].set_index('index')
tsne = manifold.TSNE(n_components = 2, random_state = 42, n_iter=1500, perplexity=3, early_exaggeration=100)
kmeans = cluster.KMeans(n_clusters = 5, random_state=42)
X = tsne.fit_transform(embeddings)
X = pd.DataFrame(X, columns = ['dim_1', 'dim_2'])
X['cluster'] = kmeans.fit_predict(X)
%config InlineBackend.figure_format = 'retina'
fgrid = sns.lmplot(
data = X,
x = 'dim_1',
y = 'dim_2',
hue = 'cluster',
fit_reg = False,
legend = False,
legend_out = False,
height = 7,
aspect = 1.6,
scatter_kws={"s": 500}
)
ax = fgrid.axes[0,0]
ax.set_ylabel('')
ax.set_xlabel('')
ax.set(xticklabels=[])
ax.set(yticklabels=[])
for i, label in enumerate(embeddings.index):
ax.text(
X['dim_1'][i] + 1,
X['dim_2'][i],
emojis_tokens[label],
horizontalalignment = 'left',
size = 'medium',
color = 'black',
weight = 'semibold',
)
plt.show()
Datasets available:
datasets.CountriesS1
datasets.CountriesS2
datasets.CountriesS3
datasets.Fb13
datasets.Fb15k
datasets.Fb15k237
datasets.InferWiki16k
datasets.InferWiki64k
datasets.Kinship
datasets.Nations
datasets.Nell995
datasets.Umls
datasets.Wn11
datasets.Wn18
datasets.Wn18rr
datasets.Yago310
Load existing dataset:
from mkb import datasets
dataset = datasets.Wn18rr(batch_size=256)
dataset
Wn18rr dataset
Batch size 256
Entities 40923
Relations 11
Shuffle True
Train triples 86834
Validation triples 3033
Test triples 3134
Or create your own dataset:
from mkb import datasets
train = [
('π¦', 'is a', 'bird'),
('π¦
', 'is a', 'bird'),
('π¦', 'hability', 'fly'),
('π¦
', 'hability', 'fly')
]
valid = [
('π¦', 'is a', 'bird')
]
test = [
('π¦', 'hability', 'fly')
]
dataset = datasets.Dataset(
train = train,
valid = valid,
test = test,
batch_size = 3,
seed = 42,
)
dataset
Dataset dataset
Batch size 3
Entities 5
Relations 2
Shuffle True
Train triples 4
Validation triples 1
Test triples 1
Knowledge graph models build latent representations of nodes (entities) and relationships in the graph. These models implement an optimization process to represent the entities and relations in a consistent space.
Models available:
models.TransE
models.DistMult
models.RotatE
models.pRotatE
models.ComplEx
models.SentenceTransformer
models.Transformer
Initialize a model:
from mkb import models
model = models.RotatE(
entities = dataset.entities,
relations = dataset.relations,
gamma = 6,
hidden_dim = 500
)
model
RotatE model
Entities embeddings dim 1000
Relations embeddings dim 500
Gamma 3.0
Number of entities 40923
Number of relations 11
Set the learning rate of the model:
import torch
learning_rate = 0.00005
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr = learning_rate,
)
Knowledge graph embedding models learn to distinguish existing triplets from generated triplets. The sampling
module allows to generate triplets that do not exist in the dataset.
from mkb import sampling
negative_sampling = sampling.NegativeSampling(
size = 256,
train_triples = dataset.train,
entities = dataset.entities,
relations = dataset.relations,
seed = 42,
)
You can train your model using a pipeline:
from mkb import compose
from mkb import losses
from mkb import evaluation
validation = evaluation.Evaluation(
true_triples = dataset.true_triples,
entities = dataset.entities,
relations = dataset.relations,
batch_size = 8,
device = device,
)
pipeline = compose.Pipeline(
epochs = 100,
eval_every = 50,
early_stopping_rounds = 3,
device = device,
)
pipeline = pipeline.learn(
model = model,
dataset = dataset,
evaluation = validation,
sampling = negative_sampling,
optimizer = optimizer,
loss = losses.Adversarial(alpha=1)
)
You can also train your model with a lower level of abstraction:
from mkb import losses
from mkb import evaluation
validation = evaluation.Evaluation(
true_triples = dataset.true_triples,
entities = dataset.entities,
relations = dataset.relations,
batch_size = 8,
device = device,
)
loss = losses.Adversarial(alpha=0.5)
for epoch in range(2000):
for data in dataset:
sample = data['sample'].to(device)
weight = data['weight'].to(device)
mode = data['mode']
negative_sample = negative_sampling.generate(sample=sample, mode=mode)
negative_sample = negative_sample.to(device)
positive_score = model(sample)
negative_score = model(
sample=sample,
negative_sample=negative_sample,
mode=mode
)
error = loss(positive_score, negative_score, weight)
error.backward()
_ = optimizer.step()
optimizer.zero_grad()
validation_scores = validation.eval(dataset=dataset.valid, model=model)
print(validation_scores)
You can evaluate the performance of your models with the evaluation
module.
from mkb import evaluation
validation = evaluation.Evaluation(
true_triples = dataset.true_triples,
entities = dataset.entities,
relations = dataset.relations,
batch_size = 8,
device = device,
)
The task of link prediction aim at finding the most likely head or tail for a given tuple. For example, the model should retrieve the entity United States
for the triplet ('Barack Obama', 'president_of', ?)
.
Validate the model on the validation set:
validation.eval(model = model, dataset = dataset.valid)
{'MRR': 0.5833, 'MR': 400.0, 'HITS@1': 20.25, 'HITS@3': 30.0, 'HITS@10': 40.0}
Validate the model on the test set:
validation.eval(model = model, dataset = dataset.test)
{'MRR': 0.5833, 'MR': 600.0, 'HITS@1': 21.35, 'HITS@3': 38.0, 'HITS@10': 41.0}
You can get a more detailed evaluation of the link prediction task and measure the performance of the model according to the type of relationship.
validation.detail_eval(model=model, dataset=dataset.test, threshold=1.5)
head tail
MRR MR HITS@1 HITS@3 HITS@10 MRR MR HITS@1 HITS@3 HITS@10
relation
1_1 0.5 2.0 0.0 1.0 1.0 0.3333 3.0 0.0 1.0 1.0
1_M 1.0 1.0 1.0 1.0 1.0 0.5000 2.0 0.0 1.0 1.0
M_1 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0
M_M 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0
The task of relation prediction is to find the most likely relation for a given tuple (head, tail).
validation.eval_relations(model=model, dataset=dataset.test)
{'MRR_relations': 1.0, 'MR_relations': 1.0, 'HITS@1_relations': 1.0, 'HITS@3_relations': 1.0, 'HITS@10_relations': 1.0}
The triplet classification task is designed to predict whether or not a triplet exists. The triplet classification task is available for every datasets in mkb
except Countries
datasets.
from mkb import evaluation
evaluation.find_threshold(
model = model,
X = dataset.classification_valid['X'],
y = dataset.classification_valid['y'],
batch_size = 10,
)
Best threshold found from triplet classification valid set and associated accuracy:
{'threshold': 1.924787, 'accuracy': 0.803803}
evaluation.accuracy(
model = model,
X = dataset.classification_test['X'],
y = dataset.classification_test['y'],
threshold = 1.924787,
batch_size = 10,
)
Accuracy of the model on the triplet classification test set:
0.793803
You can extract embeddings from entities and relationships computed by the model with the models.embeddings
property.
model.embeddings['entities']
{'hello': tensor([ 0.7645, 0.8300, -0.2343]), 'world': tensor([ 0.9186, -0.2191, 0.2018])}
model.embeddings['relations']
{'lorem': tensor([-0.4869, 0.5873, 0.8815]), 'ipsum': tensor([-0.7336, 0.8692, 0.1872])}
MKB provides an implementation of the paper Inductive Entity Representations from Text via Link Prediction. It allows to train transformers to build embeddings of the entities of a knowledge graph under the link prediction objective. After fine-tuning the transformer on the link prediction task, we can use it to build an entity search engine. It can also perform tasks related to the completion of knowledge graphs. Finally, we can use it for any downstream task such as classification.
Using a transformer instead of embeddings has many advantages, such as constructing contextualized latent representations of entities. In addition, this model can encode entities that it has never seen with the textual description of the entity. The learning time is much longer than a classical TransE model, but the model converges with fewer epochs.
MKB provides two classes dedicated to fine-tune both Sentence Transformers and vanilla Transformers.
- models.SentenceTransformer: Dedicated to Sentence Transformer models.
- models.Transformer: Dedicated to traditional Transformer models.
Under the hood, the Transformer model is trained using entity labels. Therefore, it is important to provide relevant entity labels. We initialize an embedding matrix dedicated to relationships. The negative samples are generated following the in-batch strategy.
Here is how to fine-tune a sentence transformer under the link prediction objective:
from mkb import losses, evaluation, datasets, text, models
from transformers import AutoTokenizer, AutoModel
import torch
_ = torch.manual_seed(42)
train = [
("jaguar", "cousin", "cat"),
("tiger", "cousin", "cat"),
("dog", "cousin", "wolf"),
("dog", "angry_against", "cat"),
("wolf", "angry_against", "jaguar"),
]
valid = [
("cat", "cousin", "jaguar"),
("cat", "cousin", "tiger"),
("dog", "angry_against", "tiger"),
]
test = [
("wolf", "angry_against", "tiger"),
("wolf", "angry_against", "cat"),
]
dataset = datasets.Dataset(
batch_size = 5,
train = train,
valid = valid,
test = test,
seed = 42,
shuffle=True,
)
device = "cpu"
model = models.SentenceTransformer(
model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2"),
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2"),
entities = dataset.entities,
relations = dataset.relations,
gamma = 9,
device = device,
)
model = model.to(device)
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr = 0.000005,
)
# Link prediction evaluation for Transformers
evaluation = evaluation.TransformerEvaluation(
entities = dataset.entities,
relations = dataset.relations,
true_triples = dataset.train + dataset.valid + dataset.test,
batch_size = 2,
device = device,
)
model = text.learn(
model = model,
dataset = dataset,
evaluation = evaluation,
optimizer = optimizer,
loss = losses.Adversarial(alpha=0.5),
negative_sampling_size = 5,
epochs = 1,
eval_every = 5,
early_stopping_rounds = 3,
device = device,
)
# Saving the Sentence Transformer model:
model.model.save_pretrained("model")
model.tokenizer.save_pretrained("model")
relations = {}
for id_relation, label in model.relations.items():
relations[label] = model.relation_embedding[id_relation].cpu().detach().tolist()
with open(f"relations.json", "w") as f:
json.dump(relations, f, indent=4)
After training a Sentence Transformer on the link prediction task using MKB and saving the model, we can load the trained model using the sentence_transformers
library.
from sentence_transformers import SentenceTransformer
import json
import numpy as np
# Entity encoder
model = SentenceTransformer("model", device="cpu")
# Relations embeddings
with open(f"relations.json", "r") as f:
relations = json.load(f)
Here is how to fine-tune a Transformer under the link prediction objective:
from mkb import losses, evaluation, datasets, text, models
from transformers import AutoTokenizer, AutoModel
model = models.Transformer(
model = AutoModel.from_pretrained("bert-base-uncased"),
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased"),
entities = dataset.entities,
relations = dataset.relations,
gamma = 9,
device = device,
)
# Download and navigate to the source code
$ git clone https://github.com/raphaelsty/mkb
$ cd mkb
# Create a virtual environment
$ python3 -m venv env
$ source env/bin/activate
# Install
$ python setup.py install
# Run tests
$ python -m pytest
Knowledge Base Embedding By Cooperative Knowledge Distillation
@inproceedings{sourty-etal-2020-knowledge,
title = "Knowledge Base Embedding By Cooperative Knowledge Distillation",
author = {Sourty, Rapha{\"e}l and
Moreno, Jose G. and
Servant, Fran{\c{c}}ois-Paul and
Tamine-Lechani, Lynda},
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.489",
pages = "5579--5590",
}
There are a multitude of tools and libraries available on github to build latent knowledge graph embeddings. These libraries are very complete and provide powerful implementations of knowledge graph embedding algorithms.
From a user's point of view, I find that most of the libraries suffer from a lack of modularity. That's why I created this tool. Mkb
addresses this modularity problem and is easily integrated into a machine learning pipeline.
-
DGL-KE: High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
-
OpenKE: An Open-source Framework for Knowledge Embedding implemented with PyTorch.
-
GraphVite: GraphVite is a general graph embedding engine, dedicated to high-speed and large-scale embedding learning in various applications.
-
LibKGE: LibKGE is a PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE).
-
TorchKGE: Knowledge Graph embedding in Python and Pytorch.
-
KnowledgeGraphEmbedding: RotatE, Knowledge Graph Embedding by Relational Rotation in Complex Space
This project is free and open-source software licensed under the MIT license.