A Skosmos vocabulary loader and container setup used by the HuC.
This repository contains two docker-compose files. One for local testing (docker-compose.yml
) and
one as an example Portainer-based setup (docker-compose-portainer-template.yml
).
The docker-compose.yml
file contains configuration options for both GraphDB and Fuseki. Pick one,
and comment the unneeded bits out. See the configuration details in this README to find out what to
use when.
For the Portainer setup, it's recommended to create your own Docker Compose file based on docker-compose-portainer-template.yml
.
The example setup uses Traefik, but other setups are possible as well of course. Please use the example file together with the below
section on containers and their required environment variables to create your own production Compose file to make sure
that it matches your needs.
This Skosmos setup requires two containers to run (plus additionally a triple store, if that is not hosted externally).
This is the container responsible for serving the actual Skosmos application. In order for GraphDB support to work this is based on our fork of Skosmos instead of the plain version. Other than GraphDB support, this doesn't change anything.
Env var | Description |
---|---|
SPARQL_ENDPOINT |
The SPARQL endpoint which Skosmos needs to use to connect to the triple store. Only requires read-access |
DATA |
The location of the 'data' folder containing the vocabulary configuration files |
This container is responsible for configuring and importing the used vocabularies based on the configuration files in the data
directory.
It runs side by side with the Skosmos container and uses cron to run the import script for updating vocabularies from an external source.
Env var | Description |
---|---|
SPARQL_ENDPOINT |
The SPARQL endpoint which Skosmos needs to use to connect. Should be the same as for the sd-skosmos container. |
DATABASE_TYPE |
graphdb or fuseki |
STORE_BASE |
The base endpoint of the triple store. Depends on the type, for GraphDB this is the same as the SPARQL endpoint. For fuseki, include the repository name (e.g. http://fuseki-domain:3030/skosmos ) |
ADMIN_USERNAME |
The admin username for the triple store (we need write access). |
ADMIN_PASSWORD |
The admin password for the triple store. |
DATA |
The location of the 'data' folder containing the vocabulary configuration files |
docker-compose build
docker-compose up -d
docker-compose down
To overwrite config-docker-compose.ttl put a config.ttl
in data/
. To add classifications put them in a config-ext.ttl
in data/
.
In order to add a vocabulary you will need three files. The first one is {vocabulary-name}.yaml
which needs to be
put in the data/
directory. The {vocabulary-name}.config
and {vocabulary-name}.ttl
files can be in there as well,
but they can also be loaded from an external source. See the example yaml files below to see how to configure them.
The yaml file is used for configuring how to load the vocabulary, and whether it needs to be refreshed at a certain interval. Refreshing is mainly useful when the vocabulary is loaded from an external source.
When importing vocabularies, keep in mind that you can import any file type which is supported by the database you use. Some file types are treated differently by different databases. An example is TriG: GraphDB puts all triples and quads from the file into the graph we tell it to use (the
skosmos:sparqlGraph
from the{vocab-name}.config
file), while Fuseki only imports the triples that are in the default graph in the TriG file. Any quads in the file will be ignored.Please check the behavior of your chosen database to determine which format you wish to use.
config:
type: file
location: my-vocabulary.config
source:
type: file
location: my-vocabulary.ttl
config:
refresh: Yes
refreshInterval: 1 # Hours
type: fetch
location: https://example.com/path/to/vocabulary.config
source:
type: fetch
location: https://example.com/path/to/vocabulary.ttl
headers:
X-Custom-Header: Set header values here
Note that both config and source support the same configuration options, except refresh
and refreshInterval
can only
be set for config (it will affect the entire vocabulary).
For these get requests, there are two special cases where additional authentication settings can be used. These are GitHub and GitLab, when you need to access a file in a private repository.
# Auth Settings Example
config:
type: fetch
location: https://your-gitlab-domain.com/organisation/repository/-/blob/main/vocabulary.config
auth:
type: gitlab
token: your-access-token
source:
type: fetch
location: https://raw.githubusercontent.com/organisation/repository/main/vocabulary.trig
auth:
type: github
token: your-access-token
The GitLab example will use the file location in the repository to fabricate a call to the GitLab API instead, as that is the only way to access private repositories. For GitHub, it's just a shortcut for setting the correct headers (make sure to use the path to the 'raw' file in this case).
You can also load vocabularies from a sparql endpoint. For this, you will need an additional file containing
the sparql query itself, which needs to be in the data/
directory. For the example, assume we have the query
saved in data/vocabulary.sparql
.
config:
refresh: Yes
refreshInterval: 1 # Hours
type: file
location: vocabulary.config
source:
type: sparql
location: https://example.com/sparql-endpoint
query_location: vocabulary.sparql
format: ttl # Used to specify the file extension when it's not clear from the URL
You can also retrieve the vocabulary by posting to an endpoint, using a json body.
config:
refresh: Yes
refreshInterval: 1 # Hours
type: file
location: vocabulary.config
source:
type: post
location: https://example.com/some-export-endpoint
format: trig # Used to specify the file extension when it's not clear from the URL
body: {
"config-a": "some value",
"config-b": "some other value"
}
headers:
Authorization: Bearer your-token-here
When you wish to append data to an externally loaded dataset without modifying the original data, for instance to add
a label to a property so it displays in Skosmos, you can add a "tweaks" entry to the yaml file. This supports the same
types of files as the source
, and any triples in this location will be appended to the source dataset after it's loaded.
config:
refresh: Yes
refreshInterval: 1 # Hours
type: fetch
location: https://example.com/path/to/vocabulary.config
source:
type: fetch
location: https://example.com/path/to/vocabulary.ttl
headers:
X-Custom-Header: Set header values here
tweaks:
type: file
location: vocabulary-tweaks.ttl
It is possible to set up a fallback data file or a mirror URL for when the primary source of the data is not available. When it fails to load (when using an external source, like a fetch or SPARQL query) this dataset will be loaded instead. An example configuration can be seen below (it adds a local file, for when the external fetch fails):
config:
refresh: Yes
refreshInterval: 1 # Hours
type: fetch
location: https://example.com/path/to/vocabulary.config
source:
type: fetch
location: https://example.com/path/to/vocabulary.ttl
headers:
X-Custom-Header: Set header values here
fallback:
type: file
location: vocabulary.ttl
The fallback
configuration supports all types you can use for source
.
By default, this Skosmos setup supports two triple store types: Fuseki and GraphDB. However, it is possible to extend
this by adding a custom database connector to the src/database_connectors
folder (see below). The type of database
can be set using the DATABASE_TYPE
environment variable of the sd-skosmos-loader
container, and adjusting the
SPARQL_ENDPOINT
of both the sd-skosmos
and sd-skosmos-loader
containers accordingly (see the container
configuration options above). Some database types may require additional configuration. The two options supported
out-of-the-box will be explained below.
For using GraphDB, set the following environment variables for the sd-skosmos
container:
SPARQL_ENDPOINT=http://graphdb-location:7200/repositories/skosmos
And these for the sd-skosmos-loader
:
DATABASE_TYPE=graphdb
SPARQL_ENDPOINT=http://graphdb-location:7200/repositories/skosmos
ADMIN_USERNAME=admin-username
ADMIN_PASSWORD=super-secret-password
In both cases, skosmos
can be replaced with whatever you wish to call your repository.
For using Fuseki, set the following environment variables for the sd-skosmos
container:
SPARQL_ENDPOINT=http://fuseki-location:3030/skosmos/sparql
And these for the sd-skosmos-loader
:
DATABASE_TYPE=fuseki
SPARQL_ENDPOINT=http://fuseki-location:3030/skosmos/sparql
STORE_BASE=http://fuseki-location:3030/skosmos
ADMIN_USERNAME=admin-username
ADMIN_PASSWORD=super-secret-password
In both cases, skosmos
can be replaced with whatever you wish to call your repository.
Note that the SPARQL_ENDPOINT
should be the same as the one used for the sd-skosmos
container, as that is used for
generating the Skosmos config. It is added for consistency with the other.
Database connectors can be found in src/database_connectors
. The value for DATABASE_TYPE
corresponds to one of the
python files in this directory. In order to add support for another kind of database, you can create a new file in this
directory. It should have a function called create_connector
which returns an instance of a class which extends the
src.database.DatabaseConnector
abstract base class. The other connectors (graphdb.py
and fuseki.py
) can be used
for reference.
The base class deals with SPARQL operations which are used for keeping track of which vocabularies are
loaded and when they were last updated. These methods should not need to be changed, if the database type correctly
implements SPARQL. You will need to create a setup
method for creating the Skosmos repository if it doesn't exist
yet, and an add_vocabulary
method which deals with importing vocabularies from a file. There is a helper method
sparql_http_update
which can be used if the database supports the SPARQL HTTP API.