Skip to content

Experiment using Elsevier Labs' Annotation Query library to query annotations of PubMed articles. The code is in Scala and leverage SPARK processing. The annotations for the articles are extracted from PubTator and Stanford Core NLP.

License

Notifications You must be signed in to change notification settings

pyvandenbussche/AnnotationQuery-pubmed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Querying text annotations at scale with SPARK

Documentation Status

Experiment using Elsevier Labs' Annotation Query library to query annotations of PubMed articles. The code is in Scala and leverage SPARK processing.

The annotations for the articles are extracted from PubTator and Stanford Core NLP.

Installation

Use build.sbt to install dependencies.

You need to compile (mvn package) and add the jar for the two following libraries:

Running the code

The code is tightly bound to the following workflow: Alt text

  1. ParsePubtatorXML: Running this app will parse the list of PubMed article IDs we are interested in (stored in ./data/keys). and for each article query Pubtator and store the XML response in the ./data/xml forlder. Then for each XML file, we extract the string, original document markup and pubtator annotations:

    • ./data/str contains the string content of the document stripped from any annotation (all annotation offsets referencing this text)
    • ./data/pubtator contains the pubtator annotations including Gene, Disease, Chemical, Mutation, Species and CellLine
    • ./data/om contains the original markup of the document including Document, Title and Abstract.
  2. AnnotateSCNLP: This app is using Stanford Core NLP to annotate the sentences contained in each article text. The annotations are then stored in ./data/scnlp

  3. BuildParquet: This app store each annotation set (om, pubtator and scnlp) in a parquet file in a format specified by Annotation Query.

  4. Query: This app runs several scenarios querying the annotations with logical relations such as: "give me all annotations of genes and cell lines that co-occur in the same sentence"

About

Experiment using Elsevier Labs' Annotation Query library to query annotations of PubMed articles. The code is in Scala and leverage SPARK processing. The annotations for the articles are extracted from PubTator and Stanford Core NLP.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages