The Solr Dictionary Annotator (SoDA) is a Dictionary-based Annotator (or Gazetteer) that supports exact as well as fuzzy lookups across multiple lexicons.
SoDA is backed by a Solr index which holds entity names (primary and alternate names), as well as an identifier for that entity. Multiple copies of these entity names, stemmed by a set of stemming algorithms of various strengths, are created and stored in the index. During annotation, the text to be annotated is stemmed and spans matched against similarly stemmed entity names in the index. Fast (FST based) span lookup is done using the SolrTextTagger project.
SoDA supports multiple dictionaries (lexicons) within the same Solr index. Matching modes currently supported are exact, lower (case insensitive), stop (english stopwords removed), and three levels of stemming (stem1, stem2, stem3) implemented using Solr's Minimal English Stemmer, KStem stemmer and Porter Stemmer respectively.
SoDA provides a JSON over HTTP interface. Requests are submitted to SoDA as JSON documents over HTTP POST, and SoDA responds with JSON documents. This form of API allows us to be language agnostic and cross platform. In addition, SoDA also provides a Scala and a Python client, both of which expose a programmatic interface to SoDA.
Because of the language and platform independence, SoDA can be accessed from other event sources as well, such as Apache Spark or Databricks Notebook environments using Python and Scala.
In terms of architecture, the SoDA system looks something like this. SoDA itself is a fairly lightweight application, and while this is not necessary, it can generally co-exist with Solr on the same box.
Fig 1: SoDA architecture
- Changes in this release
- SoDA Installation and Configuration
- SoDA Application Programming Interface (API)
- Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP by Sujit Pal at Spark Summit Europe 2015.
The following links describe SoDA v1, which used an older version of Solr (5.0.0) and SolrTextTagger (2.1-SNAPSHOT). The latest release of SoDA v1 can be retrieved using the tag "v1.1". The major difference between v1 and v2 is that the OpenNLP phrase based fuzzy matching has been replaced with multiple levels of stemmed matching, see Issue#12 for the discussion. Regrettably, I don't have the bandwidth to support v1, please consider moving to the latest version.
- Running SoDA from Docker (v1)
- SoDA Installation and Configuration (v1)
- SoDA Application Programming Interface (v1)
If you need to cite SoDA in your work, please use the following DOI:
Pal, Sujit (2015). Solr Dictionary Annotator [Computer Software]; https://github.com/elsevierlabs-os/soda