Automatic Assignment of ICD codes

Introduction

This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT.

Structure of the project

At the root of the project, you will have:

main.py: used for training and testing different models
requirements.txt: contains the minimum dependencies for running the project
w2vmodel.model: gensim word2vec model trained on MIMICIII discharge summaries
src: a folder that contains:
- bert: contains utilities and files for pretrained bert model
- cnn: contains utilities and files for CNN model
- hybrid: contains utilities and files for the hybrid model (LSTM+CNN) model
- rnn: contains utilities and files for LSTM and GRU models
- ovr: contains utilities and files for different Machine Learning Models (like LR, SVM, NaiveBayes)
- fit.py: training code for both LSTM and CNN models
- test_results.py: inferencing code for trained model used for both LSTM and CNN models
- utils.py: genearal utility codes used for all the models

Dependencies

The dependencies are mentioned in the requirements.txt file. They can be installed by:

pip install -r requirements.txt

How to use the code

Launch train.py with the following arguments:

train_path: path of the training data.
test_path: path of the test data
model_name: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert'
icd_type: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat'
epochs: number of epochs
batch_size: batch size, default to 16 (for bert model).
val_split: validation split of the training data, default = 2/7 (train:val:split = 5:2:3)
learning_rate: default to 2e-5 (for bert model)
w2vmodel: path for pretrained gensim word2vec model.

Example

python main.py --train_path train.csv --test_path test.csv --model_name cnn

Data

The data used for training can be downloaded from:

train data
test data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly