Merge pull request #5 from jenojp/develop

Updated for issue #4, allow users to specify own negation dictionaries.
jenojp · Aug 18, 2019 · 2856b8e · 2856b8e
2 parents fed33e3 + 62af13b
commit 2856b8e
Show file tree

Hide file tree

Showing 11 changed files with 233 additions and 59 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -3,4 +3,5 @@
 :tada: Thanks for your interest in this project :tada:
 
 * Please submit an issue request for any bugs, feature requests, or questions.
-* Feel free to fork the repo and submit a pull request.
+* Feel free to fork the repo and submit a pull request.
+* Please use [Black](https://github.com/ambv/black) to format code before submitting. 
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 # negspacy: negation for spaCy
 
-[![Build Status](https://travis-ci.org/jenojp/negspacy.svg?branch=master)](https://travis-ci.org/jenojp/negspacy) [![Built with spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io) [![pypi Version](https://img.shields.io/pypi/v/negspacy.svg?style=flat-square)](https://pypi.org/project/negspacy/)
+[![Build Status](https://travis-ci.org/jenojp/negspacy.svg?branch=master)](https://travis-ci.org/jenojp/negspacy) [![Built with spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io) [![pypi Version](https://img.shields.io/pypi/v/negspacy.svg?style=flat-square)](https://pypi.org/project/negspacy/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
 
 spaCy pipeline object for negating concepts in text. Based on the NegEx algorithm.
 
@@ -41,6 +41,28 @@ Steve Jobs True
 Apple False
 ```
 
+Consider pairing with [scispacy](https://allenai.github.io/scispacy/) to find UMLS concepts in text and process negations.
+
+## NegEx Patterns
+
+* **psuedo_negations** - phrases that are false triggers, ambiguous negations, or double negatives
+* **preceeding_negations** - negation phrases that preceed an entity
+* **following_negations** - negation phrases that follow an entity
+* **termination** - phrases that cut a sentence in parts, for purposes of negation detection (.e.g., "but")
+
+### Use own patterns or view patterns in use
+
+Use own patterns
+```python
+nlp = spacy.load("en_core_web_sm")
+negex = Negex(nlp, termination=["but", "however", "nevertheless", "except"])
+```
+
+View patterns in use
+```python
+patterns_dict = negex.get_patterns
+```
+
 ## Contributing
 [contributing](https://github.com/jenojp/negspacy/blob/master/CONTRIBUTING.md)
 

diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/negspacy.doctree b/docs/build/doctrees/negspacy.doctree
diff --git a/docs/build/html/genindex.html b/docs/build/html/genindex.html
@@ -38,11 +38,20 @@ <h3>Navigation</h3>
 <h1 id="index">Index</h1>
 
 <div class="genindex-jumpbox">
- <a href="#N"><strong>N</strong></a>
+ <a href="#G"><strong>G</strong></a>
+ | <a href="#N"><strong>N</strong></a>
  | <a href="#P"><strong>P</strong></a>
  | <a href="#T"><strong>T</strong></a>
 
 </div>
+<h2 id="G">G</h2>
+<table style="width: 100%" class="indextable genindextable"><tr>
+  <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="negspacy.html#negspacy.negation.Negex.get_patterns">get_patterns() (negspacy.negation.Negex method)</a>
+</li>
+  </ul></td>
+</tr></table>
+
 <h2 id="N">N</h2>
 <table style="width: 100%" class="indextable genindextable"><tr>
   <td style="width: 33%; vertical-align: top;"><ul>

diff --git a/docs/build/html/negspacy.html b/docs/build/html/negspacy.html
@@ -42,7 +42,7 @@ <h2>Submodules<a class="headerlink" href="#submodules" title="Permalink to this
 <span id="negspacy-negation-module"></span><h2>negspacy.negation module<a class="headerlink" href="#module-negspacy.negation" title="Permalink to this headline">¶</a></h2>
 <dl class="class">
 <dt id="negspacy.negation.Negex">
-<em class="property">class </em><code class="sig-prename descclassname">negspacy.negation.</code><code class="sig-name descname">Negex</code><span class="sig-paren">(</span><em class="sig-param">nlp</em>, <em class="sig-param">ent_types=[]</em><span class="sig-paren">)</span><a class="headerlink" href="#negspacy.negation.Negex" title="Permalink to this definition">¶</a></dt>
+<em class="property">class </em><code class="sig-prename descclassname">negspacy.negation.</code><code class="sig-name descname">Negex</code><span class="sig-paren">(</span><em class="sig-param">nlp</em>, <em class="sig-param">ent_types=[]</em>, <em class="sig-param">psuedo_negations=[]</em>, <em class="sig-param">preceeding_negations=[]</em>, <em class="sig-param">following_negations=[]</em>, <em class="sig-param">termination=[]</em><span class="sig-paren">)</span><a class="headerlink" href="#negspacy.negation.Negex" title="Permalink to this definition">¶</a></dt>
 <dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
 <blockquote>
 <div><p>A spaCy pipeline component which identifies negated tokens in text.</p>
@@ -54,9 +54,27 @@ <h2>Submodules<a class="headerlink" href="#submodules" title="Permalink to this
 <dd class="field-odd"><ul class="simple">
 <li><p><strong>nlp</strong> (<em>object</em>) – spaCy language object</p></li>
 <li><p><strong>ent_types</strong> (<em>list</em>) – list of entity types to negate</p></li>
+<li><p><strong>psuedo_negations</strong> (<em>list</em>) – list of phrases that cancel out a negation, if empty, defaults are used</p></li>
+<li><p><strong>preceeding_negations</strong> (<em>list</em>) – negations that appear before an entity, if empty, defaults are used</p></li>
+<li><p><strong>following_negations</strong> (<em>list</em>) – negations that appear after an entity, if empty, defaults are used</p></li>
+<li><p><strong>termination</strong> (<em>list</em>) – phrases that “terminate” a sentence for processing purposes such as “but”. If empty, defaults are used</p></li>
 </ul>
 </dd>
 </dl>
+<dl class="method">
+<dt id="negspacy.negation.Negex.get_patterns">
+<code class="sig-name descname">get_patterns</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#negspacy.negation.Negex.get_patterns" title="Permalink to this definition">¶</a></dt>
+<dd><p>returns phrase patterns used for various negation dictionaries</p>
+<dl class="field-list simple">
+<dt class="field-odd">Returns</dt>
+<dd class="field-odd"><p><strong>patterns</strong> – pattern_type: [patterns]</p>
+</dd>
+<dt class="field-even">Return type</dt>
+<dd class="field-even"><p>dict</p>
+</dd>
+</dl>
+</dd></dl>
+
 <dl class="method">
 <dt id="negspacy.negation.Negex.negex">
 <code class="sig-name descname">negex</code><span class="sig-paren">(</span><em class="sig-param">doc</em><span class="sig-paren">)</span><a class="headerlink" href="#negspacy.negation.Negex.negex" title="Permalink to this definition">¶</a></dt>

diff --git a/docs/build/html/objects.inv b/docs/build/html/objects.inv
diff --git a/docs/build/html/searchindex.js b/docs/build/html/searchindex.js
diff --git a/negspacy/negation.py b/negspacy/negation.py
@@ -16,69 +16,151 @@ class Negex:
         spaCy language object
     ent_types: list
         list of entity types to negate
+    psuedo_negations: list
+        list of phrases that cancel out a negation, if empty, defaults are used
+    preceeding_negations: list
+        negations that appear before an entity, if empty, defaults are used
+    following_negations: list
+        negations that appear after an entity, if empty, defaults are used
+    termination: list
+        phrases that "terminate" a sentence for processing purposes such as "but". If empty, defaults are used
 
 	"""
 
-    def __init__(self, nlp, ent_types=[]):
+    def __init__(
+        self,
+        nlp,
+        ent_types=list(),
+        psuedo_negations=list(),
+        preceeding_negations=list(),
+        following_negations=list(),
+        termination=list(),
+    ):
         if not Span.has_extension("negex"):
             Span.set_extension("negex", default=False, force=True)
-        psuedo_negations = [
-            "gram negative",
-            "no further",
-            "not able to be",
-            "not certain if",
-            "not certain whether",
-            "not necessarily",
-            "not rule out",
-            "not ruled out",
-            "not been ruled out",
-            "without any further",
-            "without difficulty",
-            "without further",
-        ]
-        preceeding_negations = [
-            "absence of",
-            "declined",
-            "denied",
-            "denies",
-            "denying",
-            "did not exhibit",
-            "no sign of",
-            "no signs of",
-            "not",
-            "not demonstrate",
-            "patient was not",
-            "rules out",
-            "doubt",
-            "negative for",
-            "no",
-            "no cause of",
-            "no complaints of",
-            "no evidence of",
-            "versus",
-            "without",
-            "without indication of",
-            "without sign of",
-            "without signs of",
-            "ruled out",
-        ]
-        following_negations = ["declined", "unlikely"]
-        termination = ["but", "however"]
+        if not psuedo_negations:
+            psuedo_negations = [
+                "gram negative",
+                "no further",
+                "not able to be",
+                "not certain if",
+                "not certain whether",
+                "not necessarily",
+                "not rule out",
+                "not ruled out",
+                "not been ruled out",
+                "without any further",
+                "without difficulty",
+                "without further",
+            ]
+        if not preceeding_negations:
+            preceeding_negations = [
+                "absence of",
+                "declined",
+                "denied",
+                "denies",
+                "denying",
+                "did not exhibit",
+                "no sign of",
+                "no signs of",
+                "not",
+                "not demonstrate",
+                "patient was not",
+                "rules out",
+                "doubt",
+                "negative for",
+                "no",
+                "no cause of",
+                "no complaints of",
+                "no evidence of",
+                "versus",
+                "without",
+                "without indication of",
+                "without sign of",
+                "without signs of",
+                "ruled out",
+            ]
+        if not following_negations:
+            following_negations = [
+                "declined",
+                "unlikely",
+                "was ruled out",
+                "were ruled out",
+                "was not",
+                "were not",
+            ]
+        if not termination:
+            termination = [
+                "although",
+                "apart from",
+                "as there are",
+                "aside from",
+                "but",
+                "cause for",
+                "cause of",
+                "causes for",
+                "causes of",
+                "etiology for",
+                "etiology of",
+                "except",
+                "however",
+                "involving",
+                "nevertheless",
+                "origin for",
+                "origin of",
+                "origins for",
+                "origins of",
+                "other possibilities of",
+                "reason for",
+                "reason of",
+                "reasons for",
+                "reasons of",
+                "secondary to",
+                "source for",
+                "source of",
+                "sources for",
+                "sources of",
+                "still",
+                "though",
+                "trigger event for",
+                "which",
+                "yet",
+            ]
 
         # efficiently build spaCy matcher patterns
-        psuedo_patterns = list(nlp.tokenizer.pipe(psuedo_negations))
-        preceeding_patterns = list(nlp.tokenizer.pipe(preceeding_negations))
-        following_patterns = list(nlp.tokenizer.pipe(following_negations))
-        termination_patterns = list(nlp.tokenizer.pipe(termination))
+        self.psuedo_patterns = list(nlp.tokenizer.pipe(psuedo_negations))
+        self.preceeding_patterns = list(nlp.tokenizer.pipe(preceeding_negations))
+        self.following_patterns = list(nlp.tokenizer.pipe(following_negations))
+        self.termination_patterns = list(nlp.tokenizer.pipe(termination))
 
         self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
-        self.matcher.add("Psuedo", None, *psuedo_patterns)
-        self.matcher.add("Preceeding", None, *preceeding_patterns)
-        self.matcher.add("Following", None, *following_patterns)
-        self.matcher.add("Termination", None, *termination_patterns)
+        self.matcher.add("Psuedo", None, *self.psuedo_patterns)
+        self.matcher.add("Preceeding", None, *self.preceeding_patterns)
+        self.matcher.add("Following", None, *self.following_patterns)
+        self.matcher.add("Termination", None, *self.termination_patterns)
         self.keys = [k for k in self.matcher._docs.keys()]
         self.ent_types = ent_types
 
+    def get_patterns(self):
+        """
+        returns phrase patterns used for various negation dictionaries
+        
+        Returns
+        -------
+        patterns: dict
+            pattern_type: [patterns]
+
+        """
+        patterns = {
+            "psuedo_patterns": self.psuedo_patterns,
+            "preceeding_patterns": self.preceeding_patterns,
+            "following_patterns": self.following_patterns,
+            "termination_patterns": self.termination_patterns,
+        }
+        for pattern in patterns:
+            logging.info(pattern)
+        return patterns
+
     def process_negations(self, doc):
         """
         Find negations in doc and clean candidate negations to remove pseudo negations
@@ -98,7 +180,12 @@ def process_negations(self, doc):
             list of tuples of terminating phrases
 
         """
-
+        if not doc.is_nered:
+            raise ValueError(
+                "Negations are evaluated for Named Entities found in text. "
+                "Your SpaCy pipeline does not included Named Entity resolution. "
+                "Please ensure it is enabled or choose a different language model that includes it."
+            )
         preceeding = list()
         following = list()
         terminating = list()

diff --git a/negspacy/test.py b/negspacy/test.py
@@ -1,3 +1,4 @@
+import pytest
 import spacy
 from negation import Negex
 
@@ -30,13 +31,15 @@ def build_med_docs():
     docs = list()
     docs.append(
         (
-            "Patient denies cardiovascular disease but has headaches. No history of smoking.",
+            "Patient denies cardiovascular disease but has headaches. No history of smoking. Alcoholism unlikely. Smoking not ruled out.",
             [
                 ("Patient", False),
                 ("denies", False),
                 ("cardiovascular disease", True),
                 ("headaches", False),
                 ("smoking", True),
+                ("Alcoholism", True),
+                ("Smoking", False),
             ],
         )
     )
@@ -53,6 +56,13 @@ def build_med_docs():
             ],
         )
     )
+
+    docs.append(
+        (
+            "Alcoholism was not the cause of liver disease.",
+            [("Alcoholism", True), ("liver disease", False)],
+        )
+    )
     return docs
 
 
@@ -78,6 +88,33 @@ def test_umls():
             assert (e.text, e._.negex) == d[1][i]
 
 
+def test_no_ner():
+    nlp = spacy.load("en_core_web_sm", disable=["ner"])
+    negex = Negex(nlp)
+    nlp.add_pipe(negex, last=True)
+    with pytest.raises(ValueError):
+        doc = nlp("this doc has not been NERed")
+
+
+def test_own_terminology():
+    nlp = spacy.load("en_core_web_sm")
+    negex = Negex(nlp, termination=["whatever"])
+    nlp.add_pipe(negex, last=True)
+    doc = nlp("He does not like Steve Jobs whatever he says about Barack Obama.")
+    assert doc.ents[1]._.negex == False
+
+
+def test_get_patterns():
+    nlp = spacy.load("en_core_web_sm")
+    negex = Negex(nlp)
+    patterns = negex.get_patterns()
+    assert type(patterns) == dict
+    assert len(patterns) == 4
+
+
 if __name__ == "__main__":
     test()
     test_umls()
+    test_bad_beharor()
+    test_own_terminology()
+    test_get_patterns()