KonText is an advanced corpus query interface and corpus data integration platform built around corpus search engine Manatee-open. It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the Institute of the Czech National Corpus.
- fully editable query chain
- any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
- multiple search modes:
- concordance,
- paradigmatic query,
- word list
- keywords analysis
- simple and advanced query types
- advanced CQL editor with syntax highlighting and attribute recognition
- interactive PoS tag composing tool for positional and key-value tagsets
- customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)
- support for spoken corpora
- defined text segments can be played back as audio
- KWIC detail with easily distinguishable speeches
- rich concordance view options and tools
- any positional attribute can be set as primary
- multiple ways how to display other attributes
- user-defined line groups - filtering, reviewing groups ratios
- tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
- rich subcorpus-related functionality
- any subcorpus is accesible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)
- once a public description is set, the subcorpus can be discovered on the "public subcorpora" page
- text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only fiction is selected?")
- a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
- unused subcorpora can be archived (URLs with the subcorpus are still valid) or completely removed (URLs will become invalid)
- searching within a subcorpous can be further refined with ad-hoc text type selection
- a subcorpus can be created with respect to corpora aligned ("give me fiction in Czech but only if there is an English translation for it")
- any subcorpus is accesible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)
- frequency distribution
- univariate
- positional attributes (including tuples of multiple attributes per token)
- structural attributes
- multivariate distribution (2 dimensions) for both positional and structural attributes
- univariate
- collocation analysis
- persistent URLs - any result page can be easily shared even if the original query is megabytes long
- access to previous queries, named queries
- convenient corpus access
- finding corpus by a keyword (tag), size, description
- adding corpus to favorites (incl. subcorpora, aligned corpora)
- saving result to Excel, CSV, XML, JSONL, TXT
- HTTP API access
- modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
- server-side written using the Sanic framework with fully decoupled background concordance/frequency/collocation calculation (using an integrated Rq worker server)
- modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database
adapters, authentication method, corpus listing widgets, HTTP session management)
- integrability with existing information systems
Running KonText as a set of Docker containers is the most convenient and flexible way. To run a basic configuration instance (i.e. no MySQL/MariaDB server, no WebSocket server) use:
docker-compose up
To run a production grade instance:
docker-compose -f docker-compose.yml -f docker-compose.mysql.yml --env-file .env.mysql up
(the .env.mysql
allows configuring custom MySQL/MariaDB credentials and KonText configuration file)
- Python 3.6 (or newer)
- Manatee corpus search engine - version 2.167.8 and onwards (for KonText v0.17, Manatee v2.2xx is recommended)
- a key-value storage
- a task queue - Rq
- HTTP proxy server
For Ubuntu OS users, it is recommended to use the install script which should perform most of the actions necessary to install and run KonText. For other Linux distributions we recommend running KonText within a container or a virtual machine. Please refer to the doc/INSTALL.md file for details.
Please refer to our Wiki.
- Institute of the Czech National Corpus
- LINDAT/CLARIAH-CZ
- CLARIN-PL
- CLARIN-SI
- Інститут української
- Serbski Institut (API version of KonText)
Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface
@inproceedings{machalek-2020-kontext,
title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",
author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.865",
pages = "7003--7008",
language = "English",
ISBN = "979-10-95546-34-4",
}