GitHub - elsevierlabs-os/soda: Solr Dictionary Annotator (Microservice for Spark)

Solr Dictionary Annotator

Introduction

The Solr Dictionary Annotator (SoDA) is a Dictionary-based Annotator (or Gazetteer) that supports exact as well as fuzzy lookups across multiple lexicons.

SoDA is backed by a Solr index which holds entity names (primary and alternate names), as well as an identifier for that entity. Multiple copies of these entity names, stemmed by a set of stemming algorithms of various strengths, are created and stored in the index. During annotation, the text to be annotated is stemmed and spans matched against similarly stemmed entity names in the index. Fast (FST based) span lookup is done using the SolrTextTagger project.

SoDA supports multiple dictionaries (lexicons) within the same Solr index. Matching modes currently supported are exact, lower (case insensitive), stop (english stopwords removed), and three levels of stemming (stem1, stem2, stem3) implemented using Solr's Minimal English Stemmer, KStem stemmer and Porter Stemmer respectively.

Usage

SoDA provides a JSON over HTTP interface. Requests are submitted to SoDA as JSON documents over HTTP POST, and SoDA responds with JSON documents. This form of API allows us to be language agnostic and cross platform. In addition, SoDA also provides a Scala and a Python client, both of which expose a programmatic interface to SoDA.

Because of the language and platform independence, SoDA can be accessed from other event sources as well, such as Apache Spark or Databricks Notebook environments using Python and Scala.

Architecture

In terms of architecture, the SoDA system looks something like this. SoDA itself is a fairly lightweight application, and while this is not necessary, it can generally co-exist with Solr on the same box.

Fig 1: SoDA architecture

More Information

Presentations

Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP by Sujit Pal at Spark Summit Europe 2015.

SoDA v1.0 links

The following links describe SoDA v1, which used an older version of Solr (5.0.0) and SolrTextTagger (2.1-SNAPSHOT). The latest release of SoDA v1 can be retrieved using the tag "v1.1". The major difference between v1 and v2 is that the OpenNLP phrase based fuzzy matching has been replaced with multiple levels of stemmed matching, see Issue#12 for the discussion. Regrettably, I don't have the bandwidth to support v1, please consider moving to the latest version.

Citing

If you need to cite SoDA in your work, please use the following DOI:

Pal, Sujit (2015). Solr Dictionary Annotator [Computer Software]; https://github.com/elsevierlabs-os/soda

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
docs		docs
project		project
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solr Dictionary Annotator

Introduction

Usage

Architecture

More Information

Presentations

SoDA v1.0 links

Citing

About

Releases 1

Packages

Contributors 3

Languages

License

elsevierlabs-os/soda

Folders and files

Latest commit

History

Repository files navigation

Solr Dictionary Annotator

Introduction

Usage

Architecture

More Information

Presentations

SoDA v1.0 links

Citing

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages