This directory contains information on retrieving data and creating models All details regarding creating, building and running the NLP model are stored here.
-
The data directory stores textual content. Methods for retrieving data should be stored in the retrieve_data folder.
-
The MedCAT models directory holds models.
Step 1: Create the model
Each of the model components are found here. This directory contains all the components required to initialise a model pack.
All models should be stored here.
Step 2: Perform training
-
Step 2.1: Unsupervised training
The unsupervised training steps can be found within unsupervised_training folder.
-
Step 2.2: Supervised training
After providing supervised labels with MedCATtrainer. The supervised training steps can be found within supervised_training folder.
Step 3: Run model
Run model on your corpus of documents and write to csv/sql db. Instructions on how to do this can be found within run_model
-
Establish your Aims, Hypothesis and Scope.
-
Define your cohort/dataset. How will you identify your cohort and relevant documents?
-
Select a standardised clinical terminology and version most suitable fit your use case.
-
Select an existing model or create your own.
-
Produce annotation guidelines. Create a “gold standard”. Manually label you’re a sample of your dataset through annotations. This sample must be as representative as possible to ensure optimal model performance.
-
Train and compare the model to your “gold standard”. These annotations can be used for supervised training or benchmarking model performance.
-
Calculate performance metrics against the annotation sample.
-
Run over your entire data set.
-
Random stratified subsample review of performance.
-
(Optional generalisability) Test model at an external site/dataset validation of steps 8,9.