Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Define De-id Behavior for Overlapping Annotations #25

Open
boyleconnor opened this issue Dec 10, 2020 · 5 comments
Open

Better Define De-id Behavior for Overlapping Annotations #25

boyleconnor opened this issue Dec 10, 2020 · 5 comments
Assignees
Labels
Enhancement New feature or request Priority: Low

Comments

@boyleconnor
Copy link
Collaborator

The first draft of the de-identifier endpoint doesn't specify how it handles overlapping annotations. An annotation can also end up getting applied multiple times across multiple deidentificationConfigs, which is not desired behavior.

@boyleconnor
Copy link
Collaborator Author

Got a working-ish solution for now (see PR #27 ); all characters that should be de-identified are de-identified, but there might not be a good visible output for this if they are overlapping (i.e. annotations can end up "swallowing" each other if they're processed first-to-last).

@tschaffter
Copy link
Member

@cascadianblue Can you post an update here regarding this ticket?

@tschaffter
Copy link
Member

@cascadianblue Can you post an update here regarding this ticket?

Is this (still?) a breaking issue or could that lead to situations that may confuse the user? If yes, please list specific example, e.g.

Person name and physical address overlap

Input: "The patient is treated at James' Hospital"
Annotations:

  • "James" (person name)
  • "James' Hospital" (physical address)
    Deidentifier output: ...

@boyleconnor boyleconnor changed the title Need More Tests and Specifications for Overlapping Annotations Better Define De-id Behavior for Overlapping Annotations Jan 27, 2021
@boyleconnor
Copy link
Collaborator Author

boyleconnor commented Jan 27, 2021

@tschaffter I don't know if I would say there is a "breaking issue", since the deidentifier always does something reasonable, but I'm not sure that it's exactly the reasonable thing that we want it to do.

In order to stay sane while de-identifying these notes, I have come up with a paradigm where every de-identification involves shifting character addresses (i.e. start and length values) in annotations to make up for the change in note text length. E.g. suppose our note says "James Smith is in Seattle" and the name annotations are [{content: "James", start: 0, length: 5}, {content: "Smith", start: 6, length: 5}] while the address annotations are [{content: "Seattle", start: 18, length: 7}].

If we apply the annotationType de-id method, like so:

{
    "note": {
        "text": "James Smith is in Seattle",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_person_name", "text_physical_address"]
        }
    ]
}

the de-identifier needs to make sure it updates the character address ranges for those annotations as it goes. After de-identifying the first person name note, the note text will look like this:

"[TEXT_PERSON_NAME] Smith is in Seattle"

and the annotations will now look like this:

Names: [{content: "James", start: 0, length: 18}, {content: "Smith", start: 19, length: 5}]
Addresses: [{content: "Seattle", start: 31, length: 7}]

The character addresses for every annotation will have to be shifted each time we de-identify one of our three total annotations. Hopefully this is all making sense so far.

Overlapping Annotations

This of course gets tricky when we have overlapping annotations. Here is an example:

Note text: "The patient was on Martin James Street"
Name annotations: [{content: "Martin James", start: 19, length: 12}]
Address annotations: [{content: "James Street", start: 26, length: 12}]

Here is a visual display of the original text and its annotations:

1:                          >|          |<
2:                   >|          |<
  "The patient was on Martin James Street"

Suppose we de-identify this note with the following request:

{
    "note": {
        "text": "The patient was on Martin James Street",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_physical_address", "text_person_name"]
        }
    ]
}

Now, here is a visual representation of the note and its annotations after we de-identify the first annotation, but before updating the character addresses on the annotations:

1:                          >|          |<
2:                   >|          |<
  "The patient was on Martin [TEXT_PHYSICAL_ADDRESS]"

I think it is pretty uncontroversial that the right-bound of the address annotation (annotation 1) needs to be shifted to the right 11 characters. But what about the right-bound of the name annotation (annotation 2)? It can't stay where it is (let me know if you disagree or don't know why I'm saying this), but which way should it be moved?

The current policy is to shift all addresses left if they are inside a region that is being de-identified. Therefore, after the first de-identification and updating the character addresses, the note text and annotation address ranges would look like this:

1:                          >|                     |<
2:                   >|     |<
  "The patient was on Martin [TEXT_PHYSICAL_ADDRESS]"

The second annotation (the person name) would then get applied, and the annotation addresses would get shifted again (without controversy this time, since there's now no overlap), and the note and annotation addresses would now look like this:

1:                                     >|                     |<
2:                   >|                |<
  "The patient was on [TEXT_PERSON_NAME][TEXT_PHYSICAL_ADDRESS]"

This seems like very reasonable behavior to me. But something weird happens if you apply the same policy but just reverse the order of annotations:

Overlapping Annotations in a different Order

{
    "note": {
        "text": "The patient was on Martin James Street",
        ...
    },
    "deidentificationConfigurations": [
        {
            "deidentificationStrategy": {"annotationTypeConfig": {}},
            "annotationTypes": ["text_person_name", "text_physical_address"]
        }
    ]
}

Here is the original note and annotation address ranges (note I switched 1 & 2):

1:                   >|          |<
2:                          >|          |<
  "The patient was on Martin James Street"

After de-identifying one annotation (notice how addresses that point inside the de-identified area get shifted left!):

1:                   >|                |<
2:                   >|                       |<
  "The patient was on [TEXT_PERSON_NAME] Street"

Now we apply the second annotation (the physical address) and shift the annotation address ranges appropriately:

1:                   ><
2:                   >|                     |<
  "The patient was on [TEXT_PHYSICAL_ADDRESS]"

This behavior seems pretty reasonable, too, but the result also looks very different from the one we had before, for reasons that don't seem super intuitive to the user. Recall how just reversing the order of the de-identifications (see earlier section) got us a result which had two bracketed [ANNOTATION_TYPE]'s.

An alternative policy

We could try another policy where character addresses get shifted left if they are start's, or right if they are lengths's. This comes with its own weird behavior, though, and before I get into it, I'd hope to hear back from you on my above explanation on why overlapping. Let me know if the above explanation makes sense. Maybe I am missing some obvious solution.

@tschaffter
Copy link
Member

@cascadianblue Thanks for the detailed answer!

Just to double-check, there is no issue regarding overlapping annotation when using only a masking character, right?

Solving this is not required for the release. This is typically a problem that we can keep open for discussion with the users of the NLP Sandbox.

@tschaffter tschaffter added Priority: Low Enhancement New feature or request and removed Priority: High Bug Something isn't working labels Jan 28, 2021
@boyleconnor boyleconnor self-assigned this Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request Priority: Low
Projects
None yet
Development

No branches or pull requests

2 participants