-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Define De-id Behavior for Overlapping Annotations #25
Comments
Got a working-ish solution for now (see PR #27 ); all characters that should be de-identified are de-identified, but there might not be a good visible output for this if they are overlapping (i.e. annotations can end up "swallowing" each other if they're processed first-to-last). |
@cascadianblue Can you post an update here regarding this ticket? |
@cascadianblue Can you post an update here regarding this ticket? Is this (still?) a breaking issue or could that lead to situations that may confuse the user? If yes, please list specific example, e.g. Person name and physical address overlapInput: "The patient is treated at James' Hospital"
|
@tschaffter I don't know if I would say there is a "breaking issue", since the deidentifier always does something reasonable, but I'm not sure that it's exactly the reasonable thing that we want it to do. In order to stay sane while de-identifying these notes, I have come up with a paradigm where every de-identification involves shifting character addresses (i.e. If we apply the {
"note": {
"text": "James Smith is in Seattle",
...
},
"deidentificationConfigurations": [
{
"deidentificationStrategy": {"annotationTypeConfig": {}},
"annotationTypes": ["text_person_name", "text_physical_address"]
}
]
} the de-identifier needs to make sure it updates the character address ranges for those annotations as it goes. After de-identifying the first person name note, the note text will look like this:
and the annotations will now look like this: Names: The character addresses for every annotation will have to be shifted each time we de-identify one of our three total annotations. Hopefully this is all making sense so far. Overlapping AnnotationsThis of course gets tricky when we have overlapping annotations. Here is an example: Note text: Here is a visual display of the original text and its annotations:
Suppose we de-identify this note with the following request: {
"note": {
"text": "The patient was on Martin James Street",
...
},
"deidentificationConfigurations": [
{
"deidentificationStrategy": {"annotationTypeConfig": {}},
"annotationTypes": ["text_physical_address", "text_person_name"]
}
]
} Now, here is a visual representation of the note and its annotations after we de-identify the first annotation, but before updating the character addresses on the annotations:
I think it is pretty uncontroversial that the right-bound of the address annotation (annotation 1) needs to be shifted to the right 11 characters. But what about the right-bound of the name annotation (annotation 2)? It can't stay where it is (let me know if you disagree or don't know why I'm saying this), but which way should it be moved? The current policy is to shift all addresses left if they are inside a region that is being de-identified. Therefore, after the first de-identification and updating the character addresses, the note text and annotation address ranges would look like this:
The second annotation (the person name) would then get applied, and the annotation addresses would get shifted again (without controversy this time, since there's now no overlap), and the note and annotation addresses would now look like this:
This seems like very reasonable behavior to me. But something weird happens if you apply the same policy but just reverse the order of annotations: Overlapping Annotations in a different Order{
"note": {
"text": "The patient was on Martin James Street",
...
},
"deidentificationConfigurations": [
{
"deidentificationStrategy": {"annotationTypeConfig": {}},
"annotationTypes": ["text_person_name", "text_physical_address"]
}
]
} Here is the original note and annotation address ranges (note I switched 1 & 2):
After de-identifying one annotation (notice how addresses that point inside the de-identified area get shifted left!):
Now we apply the second annotation (the physical address) and shift the annotation address ranges appropriately:
This behavior seems pretty reasonable, too, but the result also looks very different from the one we had before, for reasons that don't seem super intuitive to the user. Recall how just reversing the order of the de-identifications (see earlier section) got us a result which had two bracketed An alternative policyWe could try another policy where character addresses get shifted left if they are |
@cascadianblue Thanks for the detailed answer! Just to double-check, there is no issue regarding overlapping annotation when using only a masking character, right? Solving this is not required for the release. This is typically a problem that we can keep open for discussion with the users of the NLP Sandbox. |
The first draft of the de-identifier endpoint doesn't specify how it handles overlapping annotations. An annotation can also end up getting applied multiple times across multiple
deidentificationConfig
s, which is not desired behavior.The text was updated successfully, but these errors were encountered: