This repository contains the the builder and configuration files for our instances of Korp. We build one image for the backend, and one for the frontend, for each of the 18 languages we have corpora for.
File listing of this repository:
github.com/giellatekno/korp (this repository)
README.md - this file
tasks.py - script to build and push images, and run them locally
gtweb2_korp_settings/
translations/ - translations needed by the frontend,
will be built into the image (same for all instances)
front/
config-LANG.yam - frontend config, will be built into the image
(one file per LANG)
corpus_config/ - backend config (one folder per LANG)
LANG/
attributes/
structural/ - contains metadata fields of the corpora,
stuff like title, one file per property
date, isbn, etc
positional/ - data per word, such as pos, deprel, etc,
one file per column (per property)
corpora/ - one .yaml file per cwb corpora,
e.g. "sme_admin_20211118.yaml",
modes/ - each "sub-corp". So we could have one for the
default korp for that language. one for historical
corpora, one for parallell, etc. språkbanken
uses modes for separating them corpora like that.
selector-page/
index.html - simple page with links to the various korp sites
visible at https://gtweb-02.uit.no/korp/
deployed as a static site in /var/www/html
The Dockerfiles to build the images is embedded in the source code of tasks.py. On the server, running them is done with the gtweb-service-script. It knows where this repo is located, so it can mount in the configuration files from this repo correctly.
The frontend image contains the built version of korp-frontend, with the
lang it hosts built into its configuration. Which backend the image uses
is configured by setting the environment variable BACKEND
at image runtime.
Korp-frontend requires the backend to be set in the yaml config file at
built time (when yarn build
is run). However, the built site contains this
definition in only a single javascript file. The image has a custom entrypoint
script that will will use sed
to in-place modify the built javascript file
when the image starts up, to set backend to whatever the env var BACKEND
is
set to. This way, we can set the korp_backend_url
at image runtime, instead
of at image built time, drastically reducing the number of images we have to
maintain (usually, the alternative would imply 2 different images for running
it locally, and on the server, and additional variants of the image for each
backend).
Building the frontend image is done with the command:
python tasks.py build front <lang>
The backend reads config files at startup to determine all of its settings, so there is just one built image. Running a different instance is just a matter of mounting in different config files. Build the backend using
python tasks.py build back
While testing
t to build images, and to push them to the ACR (Azure Container Registry) that we have on Azure (which the server then pulls images from). It can also be used to run the images locally, but you do need to have the compiled cwb files for the backend to actually have any data.
All required configuration files and paths are mapped into the container at runtime, using bind mounts. The configuration files
(argument -v OUTSIDE_PATH:INSIDE_PATH
passed to
podman run
, where OUTSIDE_PATH
means a path on the system that runs the
container, and INSIDE_PATH
means where this will be visible inside the container).
These are:
-v /home/anders/korp/config.py:/korp/korp-backend/instance/config.py
. This is the "root" configuration file for the backend. It defines where the other paths are. (Note: local path will change!)-v /corpora/gt_cwb:/corpora/gt_cwb
. This entire folder is mapped in to the same path in the container. It's where the CWB files are. We map it to this specific path, because that's where theconfig.py
tells Korp to look.-v /home/anders/korp/corpus_configs/INSTANCE:/opt/korp-backend/corpus_config
(replaceINSTANCE
withsmi
,nsu
orother
). This is where Korp reads about the corpuses, and how to present them, what options are set, etc. We run three instances of Korp, and each gets their owncorpus_config
. We map it into/opt/korp-backend/corpus_config
inside the container, because ourconfig.py
file (mapped earlier) defines that this is where to look.
In addition to the bind mounts for configuration, we have to set the port
binding, using -p OUTSIDE_PORT:1234
, where we replace OUTSIDE_PORT
with
1341
for smi
, 1343
for nsu
and 1345
for other
.
Finally, there are some flags we must give for the container to function
properly, namely --privileged
and --security-opt label=disable
.
The full command is therefore:
podman run --privileged -d -p OUTSIDE_PORT:1234 -v /corpora/gt_cwb:/corpora/gt_cwb -v /home/anders/korp/corpus_configs/INSTANCE:/opt/korp-backend/corpus_config -v /home/anders/korp/config.py:/korp/korp-backend/instance/config.py --security-opt label=disable --name korp-backend-INSTANCE korp-backend
On the server, we set up systemd units that runs each container for us. We first
create the container, using the podman run
commands, with a given --name
,
so that we can can then use podman generate systemd NAME > /etc/systemd/user/NAME.service
to generate the systemd unit file for that service.
The usual CorpusTools
approach (steps 1-4), followed by more "server-administrative" tasks as the final steps (steps 5-8):
- import (
add_files_to_corpus
) - adds the files under some specified category, writes corresponding meta-file (xsl)- fill in meta data
- extract content from sources (
convert2xml
) - analyse using analyser (
analyse_corpus
) (requires analyser for that language built with correct ./configure params) - make korp-mono ready (
korp_mono
) - extracts from analysis, converts the format to CWB-input format, and concatenates all files under a subcategory to one file (e.g. "sme_admin", admin is the category, so "sme_admin" will be one "corpus" (as defined by CWB)) - pack it up (
pack_cwbmono_files
) (which really just makes a tarball) - send it to the server (scp) (note: HARD TO AUTOMATE)
- [on server] unpack and update registry-files paths (
unpack_cwbmono_files
) - Make sure to update the
corpus_config
. Change dates of corpora files, etc. (needs more explanation) - [on server] Restart the Korp instance for it to take effect (
podman restart korp-backend-<group>
,group
beingsmi
(sami),nsu
(non-sami uralic),other
[??])- Subject to change if deployed differently (but it should ideally just be to restart some pod somewhere, or a similarly simple command)
There is no system for this. Anything that is in the corpus_config must be present in the CWB files. CWB files that are not referenced in the corpus_config are ignored.
The frontend is simple, and should just be cloning, yarn
to install dependencies,
and a yarn dev
to start the development server. The problem is that it needs
a backend, otherwise it will just show an error screen.
The backend is a bit more work. It needs a locally setup corpus_config
, CWB
tools installed, as well as the config.py
itself. It's possible to just copy
over some of the corpora that one wants to test against, but it's a little
bit of work.
Another issue is testing something against the full dataset. The entire dataset of just the CWB files themselves are 5.9 GB.
We need it up and running. Probably just a mariadb container, running with a bind mount to a database folder on the host.
This must be reachable from the backend. We could use a podman network, or just expose a port locally and connect to that through the host.
We need to create the tables, the schema for them are in the documentation, so that's fine. However, grepping through the source reveals 0 "INSERT" statements, meaning we must fill the database ourselves. How do we do that? -- I assume it's a one-time thing, but we still have to do it.
Later problems: How do we make sure that there aren't problems if we remove, add, or rename (update) some corpora?