The project revolves around a unique natural language processing challenge where the primary goal is to recognize specific named entities, especially in the context of medical conditions and diagnoses. The project used Python 3.11.6.
A key issue of this project is gathering and annotating data, considering there's a scarcity of publicly available datasets online for this particular problem. Nonetheless, existing medical lexicons, such as Snomed-CT and ICD-10, could potentially enhance the model's vocabulary, although I do not know about the feasibility of this idea.
While there are pre-trained models on Huggingface suitable for similar NER tasks, they haven't publicly disclosed their training datasets.
The main idea is free-text processing for extracting diagnoses and diseases from medical notes. This type of named entity recognition might be useful for e.g. converting unstructured data to specially constructed standards, suitable for deployment in Hospital Information Systems.
For this kind of NER project, BERT models are anticipated to be effective. As such, the plan is to harness BERT base models and adapt them for this specialized task.
ESSENTIALS
The primary dataset originates from the TREC CT topics, publicly accessible here: http://trec-cds.org/
Each topic has a similar structure, including several diagnoses in free text format. The topics represent admission notes - notes with the most important patient details, which a doctor takes as soon as a person is admitted to a hospital. This includes personal information and demographics, such as gender and age, but also and most importantly the current medical conditions, personal medical history and family medical history. For simplification purposes, the focus lies on detecting diseases/diagnoses present in the text, covering conditions such as diabetes mellitus or high blood pressure.
This dataset makes a total of 255 entries (topics). This includes:
- topics2016.xml - valuable information in note, description and summary. 30 topics in total. The fields could be processed individually, though, creating a total of 90 topics.
- topics2021.xml - 75 topics in total.
- topics2022.xml - 50 topics in total.
- topics2023.xml - preprocessed to free text in admission note style via LLM - 40 topics in total.
ADDITIONALS
Should the topic-dataset not be enough in case of inference (e.g. error metrics too high), I will include more data from the ClinicalTrials.gov database. It contains information on clinical trials, including free text descriptions on said trials. This may be useful to further enhance the model's performance - given the complexity of annotating this kind of data, I would consider this only if the model's vocabulary does not suffice.
Since vocabulary in the medical world is complex and diverse, it might be incredibly useful to enhance the model's vocabulary with already existing medical thesauri. Some of which (such as ICD-10) are publicly available and continuously updated by medical professionals. However, I am yet uncertain on how to incorporate the thesaurus in a BERT's model vocabulary.
Language: All data (text) being used in this project will be in English.
- requirements engineering Goal: Study and collect requirements for both the project and BERT architecture. Possibly find tools on how to efficiently annotate data. Find error metrics suitable for this task. Time: 5h Deadline: 29th Oct.
- capturing and annotating data Goal: Collect all necessary data and make it ready for training the model. Time: 25h Deadline: 12th Nov.
- describing data Goal: Describe data, make plots for data visualization (e.g. wordclouds) for better understanding of the data we are working with. Time: 5h Deadline: 14th Nov.
- implementing BERT Goal: Implement and train a working BERT model. Time: 15h Deadline: 5th Dec.
- tuning BERT Goal: Perform Hyperparameter Tuning and detect possible defects. Time: 10h Deadline: 19th Dec.
- report, presentation and application Goal: Write a finished report and make a visually appealing presentation. Include a small Angular webapp. Time: 10h Deadline: 16th Jan.
- Named entity recognition and normalization in biomedical literature: a practical case in SARS-CoV-2 literature (https://oa.upm.es/67933/)
This led to the conception of BioBERT, a model refined for disease recognition in biomedical texts. Available at: https://huggingface.co/alvaroalon2/biobert_diseases_ner
- Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python (https://arxiv.org/abs/2106.07799)
The result was a Python package used by medical experts for processing of biomedical texts. Interestingly enough, this tool can capture many entities and fields that are derived via rudimentary regular expressions as seen in its source code. Available at: https://github.com/medspacy/medspacy
Data Labelling has been done via doccano.
I have encountered several interesting issues while labelling data as 'medical conditions', since the definiton of a medical condition is not clear and is subject to interpretation. For instance, it is uncertain whether 'fever' should be classified as a medical condition (i.e. disease) or a symptom. The same counts for fracture of bone etc. For the purpose of this exercise, I have looked up several medical ontologies and websites, in order to see how specific medical lingo is classified. As an example, 'fever' or 'dyspnea' have, in fact, not been listed as medical conditions, but rather as symptoms.
Since doctors use many abbreviations for admission notes (e.g. 'CAD' for 'Coronary Artery Disease'), this website was very helpful: https://www.allacronyms.com/
I have decided for wordclouds and word-frequency tables to get an overlook over the text data as well as the classified data.
Wordcloud for free-text admission notes
Word-Freqency Table for free-text admission notes
Wordcloud for medical conditions
Word-Frequency Table for medical conditions
For this task (named entity recognition), we want both high precision and high recall. The precision measures the percentage of the model's predicted entities that are correct. High precision indicates a low false positive rate, which is important in medical application, as a means to avoid false alarms. On the other hand, recall measures the percentage of actual entities that the model correcly identified. High recall is important in medical NER, to ensure that no critical information is missed.
The logical conclusion is using the harmoic mean of both, which is the f1 score. It provides a balance between precision and recall. An ideal model will have both high precision and high recall, leading to a high f1-score - this is often times the primary metric for NER tasks, and this is the metric we will focus on during hyperparameter-tuning.
Using ambiguous parameters with SGD already provided great results, with an f1-score of ~0.85. I believe that with proper hyperparameter tuning we can exceed 0.9. The goal would be an f1-score of at least 0.92.
- A higher learning rate leads to worse results. (lr: 0.1, f1: ~0.3)
- The Adam optimizer performs poorly in comparison to SGD with Momentum
- The f1-score does no longer significantly change/gets worse after ~5 epochs.
- Too little learning rate leads to worse results. (lr: 0.0001, f1: ~0.3)
- Slightly higher rate reads to better results (lr: 0.001, f1: ~0.5)
- Higher batch size leads to slightly worse f1 score ({batch size: 16, f1: 0.92} -> {batch size: 32, f1: 0.83})
Currently best parameters:
{'batch_size': 8, 'learning_rate': 0.01, 'epochs': 5, 'optimizer': 'SGD', 'max_tokens': 128} -> 0.93 (with base BERT)
- requirements engineering Time Planned: 5h Time Spent: 4h Notes: Thanks to a colleague, finding appropriate tools for data annotation was easy.
- capturing and annotating data Time Planned: 25h Time Spent: 35h Notes: Data Annotation was way harder than I anticipated, vastly due to the very complicated medical lingo and unclearly defined differences between symptoms and medical conditions.
- describing data Time Planned: 5h Time Spent: 3.5h Notes: Insights were incredibly useful for finding a good value for maximum tokens.
- implementing BERT Time Planned: 15h Time Spent: 22h Notes: Unexpected difficulties utilizing BioBERT for transfer-learning.
- tuning BERT Time Planned: 10h Time Spent: 19h (active) Notes: Setting up the environment for hyperparameter-tuning was not as hard as expected, but the tuning itself was way more computationally intense than anticipated. Furthermore, there initially was weird behaviour with the error metrics (explosive error rate and rapid forgetting) - fixing the bug was rather expensive.