Querying text annotations at scale with SPARK

Experiment using Elsevier Labs' Annotation Query library to query annotations of PubMed articles. The code is in Scala and leverage SPARK processing.

The annotations for the articles are extracted from PubTator and Stanford Core NLP.

Installation

Use build.sbt to install dependencies.

You need to compile (mvn package) and add the jar for the two following libraries:

The code is tightly bound to the following workflow:

ParsePubtatorXML: Running this app will parse the list of PubMed article IDs we are interested in (stored in ./data/keys). and for each article query Pubtator and store the XML response in the ./data/xml forlder. Then for each XML file, we extract the string, original document markup and pubtator annotations:
- ./data/str contains the string content of the document stripped from any annotation (all annotation offsets referencing this text)
- ./data/pubtator contains the pubtator annotations including Gene, Disease, Chemical, Mutation, Species and CellLine
- ./data/om contains the original markup of the document including Document, Title and Abstract.
AnnotateSCNLP: This app is using Stanford Core NLP to annotate the sentences contained in each article text. The annotations are then stored in ./data/scnlp
BuildParquet: This app store each annotation set (om, pubtator and scnlp) in a parquet file in a format specified by Annotation Query.
Query: This app runs several scenarios querying the annotations with logical relations such as: "give me all annotations of genes and cell lines that co-occur in the same sentence"

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/keys		data/keys
src/main/scala/com/AQPubmed		src/main/scala/com/AQPubmed
LICENSE		LICENSE
Overview.png		Overview.png
README.md		README.md
build.sbt		build.sbt