Better Define De-id Behavior for Overlapping Annotations #25

boyleconnor · 2020-12-10T23:07:33Z

The first draft of the de-identifier endpoint doesn't specify how it handles overlapping annotations. An annotation can also end up getting applied multiple times across multiple deidentificationConfigs, which is not desired behavior.

The text was updated successfully, but these errors were encountered:

boyleconnor · 2020-12-19T01:03:43Z

Got a working-ish solution for now (see PR #27 ); all characters that should be de-identified are de-identified, but there might not be a good visible output for this if they are overlapping (i.e. annotations can end up "swallowing" each other if they're processed first-to-last).

tschaffter · 2021-01-07T22:54:18Z

@cascadianblue Can you post an update here regarding this ticket?

tschaffter · 2021-01-25T16:09:02Z

@cascadianblue Can you post an update here regarding this ticket?

Is this (still?) a breaking issue or could that lead to situations that may confuse the user? If yes, please list specific example, e.g.

Person name and physical address overlap

Input: "The patient is treated at James' Hospital"
Annotations:

"James" (person name)
"James' Hospital" (physical address)
Deidentifier output: ...

boyleconnor · 2021-01-27T22:44:28Z

@tschaffter I don't know if I would say there is a "breaking issue", since the deidentifier always does something reasonable, but I'm not sure that it's exactly the reasonable thing that we want it to do.

In order to stay sane while de-identifying these notes, I have come up with a paradigm where every de-identification involves shifting character addresses (i.e. start and length values) in annotations to make up for the change in note text length. E.g. suppose our note says "James Smith is in Seattle" and the name annotations are [{content: "James", start: 0, length: 5}, {content: "Smith", start: 6, length: 5}] while the address annotations are [{content: "Seattle", start: 18, length: 7}].

If we apply the annotationType de-id method, like so:

{
    "note": {
        "text": "James Smith is in Seattle",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_person_name", "text_physical_address"]
        }
    ]
}

the de-identifier needs to make sure it updates the character address ranges for those annotations as it goes. After de-identifying the first person name note, the note text will look like this:

"[TEXT_PERSON_NAME] Smith is in Seattle"

and the annotations will now look like this:

Names: [{content: "James", start: 0, length: 18}, {content: "Smith", start: 19, length: 5}]
Addresses: [{content: "Seattle", start: 31, length: 7}]

The character addresses for every annotation will have to be shifted each time we de-identify one of our three total annotations. Hopefully this is all making sense so far.

Overlapping Annotations

This of course gets tricky when we have overlapping annotations. Here is an example:

Note text: "The patient was on Martin James Street"
Name annotations: [{content: "Martin James", start: 19, length: 12}]
Address annotations: [{content: "James Street", start: 26, length: 12}]

Here is a visual display of the original text and its annotations:

1:                          >|          |<
2:                   >|          |<
  "The patient was on Martin James Street"

Suppose we de-identify this note with the following request:

{
    "note": {
        "text": "The patient was on Martin James Street",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_physical_address", "text_person_name"]
        }
    ]
}

Now, here is a visual representation of the note and its annotations after we de-identify the first annotation, but before updating the character addresses on the annotations:

1:                          >|          |<
2:                   >|          |<
  "The patient was on Martin [TEXT_PHYSICAL_ADDRESS]"

I think it is pretty uncontroversial that the right-bound of the address annotation (annotation 1) needs to be shifted to the right 11 characters. But what about the right-bound of the name annotation (annotation 2)? It can't stay where it is (let me know if you disagree or don't know why I'm saying this), but which way should it be moved?

The current policy is to shift all addresses left if they are inside a region that is being de-identified. Therefore, after the first de-identification and updating the character addresses, the note text and annotation address ranges would look like this:

1:                          >|                     |<
2:                   >|     |<
  "The patient was on Martin [TEXT_PHYSICAL_ADDRESS]"

The second annotation (the person name) would then get applied, and the annotation addresses would get shifted again (without controversy this time, since there's now no overlap), and the note and annotation addresses would now look like this:

1:                                     >|                     |<
2:                   >|                |<
  "The patient was on [TEXT_PERSON_NAME][TEXT_PHYSICAL_ADDRESS]"

This seems like very reasonable behavior to me. But something weird happens if you apply the same policy but just reverse the order of annotations:

Overlapping Annotations in a different Order

{
    "note": {
        "text": "The patient was on Martin James Street",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_person_name", "text_physical_address"]
        }
    ]
}

Here is the original note and annotation address ranges (note I switched 1 & 2):

1:                   >|          |<
2:                          >|          |<
  "The patient was on Martin James Street"

After de-identifying one annotation (notice how addresses that point inside the de-identified area get shifted left!):

1:                   >|                |<
2:                   >|                       |<
  "The patient was on [TEXT_PERSON_NAME] Street"

Now we apply the second annotation (the physical address) and shift the annotation address ranges appropriately:

1:                   ><
2:                   >|                     |<
  "The patient was on [TEXT_PHYSICAL_ADDRESS]"

This behavior seems pretty reasonable, too, but the result also looks very different from the one we had before, for reasons that don't seem super intuitive to the user. Recall how just reversing the order of the de-identifications (see earlier section) got us a result which had two bracketed [ANNOTATION_TYPE]'s.

An alternative policy

We could try another policy where character addresses get shifted left if they are start's, or right if they are lengths's. This comes with its own weird behavior, though, and before I get into it, I'd hope to hear back from you on my above explanation on why overlapping. Let me know if the above explanation makes sense. Maybe I am missing some obvious solution.

tschaffter · 2021-01-28T04:44:15Z

@cascadianblue Thanks for the detailed answer!

Just to double-check, there is no issue regarding overlapping annotation when using only a masking character, right?

Solving this is not required for the release. This is typically a problem that we can keep open for discussion with the users of the NLP Sandbox.

boyleconnor mentioned this issue Dec 10, 2020

Implement support for different deid strategy for each annotation type (server) #13

Closed

boyleconnor added the Bug Something isn't working label Dec 10, 2020

tschaffter added the Priority: High label Jan 7, 2021

boyleconnor changed the title ~~Need More Tests and Specifications for Overlapping Annotations~~ Better Define De-id Behavior for Overlapping Annotations Jan 27, 2021

tschaffter added Priority: Low Enhancement New feature or request and removed Priority: High Bug Something isn't working labels Jan 28, 2021

boyleconnor self-assigned this Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Define De-id Behavior for Overlapping Annotations #25

Better Define De-id Behavior for Overlapping Annotations #25

boyleconnor commented Dec 10, 2020

boyleconnor commented Dec 19, 2020

tschaffter commented Jan 7, 2021

tschaffter commented Jan 25, 2021

boyleconnor commented Jan 27, 2021 •

edited

Loading

tschaffter commented Jan 28, 2021

Better Define De-id Behavior for Overlapping Annotations #25

Better Define De-id Behavior for Overlapping Annotations #25

Comments

boyleconnor commented Dec 10, 2020

boyleconnor commented Dec 19, 2020

tschaffter commented Jan 7, 2021

tschaffter commented Jan 25, 2021

Person name and physical address overlap

boyleconnor commented Jan 27, 2021 • edited Loading

Overlapping Annotations

Overlapping Annotations in a different Order

An alternative policy

tschaffter commented Jan 28, 2021

boyleconnor commented Jan 27, 2021 •

edited

Loading