Skip to content

An advanced, extensible web front-end for the Manatee-open corpus search engine

License

Notifications You must be signed in to change notification settings

CLARIN-PL/clarin-kontext

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KonText screenshot

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration platform built around corpus search engine Manatee-open. It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the Institute of the Czech National Corpus.

Features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • multiple search modes:
    • concordance,
    • paradigmatic query,
    • word list
    • keywords analysis
  • simple and advanced query types
    • advanced CQL editor with syntax highlighting and attribute recognition
    • interactive PoS tag composing tool for positional and key-value tagsets
    • customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)
  • support for spoken corpora
    • defined text segments can be played back as audio
    • KWIC detail with easily distinguishable speeches
  • rich concordance view options and tools
    • any positional attribute can be set as primary
    • multiple ways how to display other attributes
    • user-defined line groups - filtering, reviewing groups ratios
    • tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • rich subcorpus-related functionality
    • any subcorpus is accesible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)
      • once a public description is set, the subcorpus can be discovered on the "public subcorpora" page
    • text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • unused subcorpora can be archived (URLs with the subcorpus are still valid) or completely removed (URLs will become invalid)
    • searching within a subcorpous can be further refined with ad-hoc text type selection
    • a subcorpus can be created with respect to corpora aligned ("give me fiction in Czech but only if there is an English translation for it")
  • frequency distribution
    • univariate
      • positional attributes (including tuples of multiple attributes per token)
      • structural attributes
    • multivariate distribution (2 dimensions) for both positional and structural attributes
  • collocation analysis
  • persistent URLs - any result page can be easily shared even if the original query is megabytes long
  • access to previous queries, named queries
  • convenient corpus access
    • finding corpus by a keyword (tag), size, description
    • adding corpus to favorites (incl. subcorpora, aligned corpora)
  • saving result to Excel, CSV, XML, JSONL, TXT
  • HTTP API access

Internal features

  • modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
  • server-side written using the Sanic framework with fully decoupled background concordance/frequency/collocation calculation (using an integrated Rq worker server)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)
    • integrability with existing information systems

Installation

Docker

Running KonText as a set of Docker containers is the most convenient and flexible way. To run a basic configuration instance (i.e. no MySQL/MariaDB server, no WebSocket server) use:

docker-compose up

To run a production grade instance:

docker-compose -f docker-compose.yml -f docker-compose.mysql.yml --env-file .env.mysql up

(the .env.mysql allows configuring custom MySQL/MariaDB credentials and KonText configuration file)

Manual installation

Key requirements

  • Python 3.6 (or newer)
  • Manatee corpus search engine - version 2.167.8 and onwards (for KonText v0.17, Manatee v2.2xx is recommended)
  • a key-value storage
    • Redis (recommended), SQLite (supported), custom implementations possible
  • a task queue - Rq
  • HTTP proxy server

For Ubuntu OS users, it is recommended to use the install script which should perform most of the actions necessary to install and run KonText. For other Linux distributions we recommend running KonText within a container or a virtual machine. Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable users

How to cite KonText

Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface

@inproceedings{machalek-2020-kontext,
    title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",
    author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.865",
    pages = "7003--7008",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

About

An advanced, extensible web front-end for the Manatee-open corpus search engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 61.9%
  • Python 34.8%
  • JavaScript 1.3%
  • HTML 1.3%
  • Shell 0.2%
  • PEG.js 0.2%
  • Other 0.3%