Skip to content

Solr Dictionary Annotator (Microservice for Spark)

License

Notifications You must be signed in to change notification settings

zentiment/soda

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##Solr Dictionary Annotator

###Introduction

The Solr Dictionary Annotator (SoDA) is a Dictionary-based Annotator (or Gazetteer) that supports exact as well as fuzzy lookups across multiple lexicons.

SoDA is backed by a Solr index into which entity names (and synonyms) are entered, as well as the identifier for that entity. Fast (FST based) span lookup is done using the SolrTextTagger project. Additional fuzzy lookup features are supported using OpenNLP and a mix of various normalization strategies.

###Usage

SoDA provides a JSON over HTTP interface. Requests are submitted to SoDA as JSON documents over HTTP POST, and SoDA responds with JSON documents. This form of API allows us to be language agnostic and cross platform. SoDA can be accessed from individual clients, Spark standalone applications and the Databricks Notebook environment using Python and Scala. Details of SoDA's REST API can be found here.

###Architecture

In terms of architecture, the SoDA system looks something like this.

Architecture

Callers invoke the annotate (TBD) function with the necessary parameters, which results in a JSON/HTTP call to the SoDA webapp. Some calls, such as exact and lowercase lookup are passed directly to the SolrTextTagger. Other calls such as punctuation normalized lookup or unordered or fuzzy lookups, need the input string to be tokenized and the appropriate query made to Solr instead. For example, punctuation normalized lookups would require sentence normalization to ensure we don't match across sentence boundaries, and unordered or fuzzy lookups will require extracting phrases and matching.

###More Information

Citing

If you need to cite SoDA in your work, please use the following DOI:

DOI

Pal, Sujit (2015). Solr Dictionary Annotator [Computer Software]; https://github.com/elsevierlabs-os/soda

About

Solr Dictionary Annotator (Microservice for Spark)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 96.2%
  • Python 2.4%
  • Java 1.4%