The goal of the project was to work towards the development of a natural language interface that can parse a user question or statement, transform it into a structured criteria representation and produce an executable clinical data query represented as an SQL query conforming to an EHR Common Data Model.
- Clone NL2SQL Repo.
git clone https://github.com/umar1997/NL2SQL.git
- Create and Activate Environment
pip install virualenv
virtualenv nlp2sqlEnv
cp ./NL2SQL/activateEnv.sh .
source activateEnv.sh
- Install Dependencies
pip install -r requirements.txt
- Download Chia Dataset
source download.sh
mv ./Raw_Data/* ./Data
rmdir ./Raw_Data/
Download the data and model files following from here. Then add them according to the file structure as shown in the following section.
Note:
Folders: These folders should be ignored (Used for personal learning)
- Hugging Face Tutorial/
- MLM/
- NER/Extra/
- SQL_GEN/Extra/
- To create Chia_w_scope_data.csv and Chia_w_scope_data.csv run:
python ./Data_Processing/data_processing.py
- To train NER model run:
cd Models/NER/
python main.py \
--model_type dmis-lab/biobert-v1.1 \
--tokenizer_type dmis-lab/biobert-v1.1 \
--data_dir ./../../Data/Chia_w_scope_data.csv \
--max_seq_length 80 \
--batch_size 16 \
--learning_rate 5e-5 \
--num_epochs 5 \
--val_split 0.30 \
--seed 42 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--optimizer AdamW \
--scheduler LinearWarmup \
--log_folder ./Log_Files/ \
--log_file biobert_.log
- To train SQL Generation model run:
cd Models/SQL_GEN/
python main.py \
--model_name mrm8488/t5-base-finetuned-wikiSQL \
--tokenizer_name mrm8488/t5-base-finetuned-wikiSQL \
--data_dir ./../../Data/PreparedText2SQL \
--max_input_length 256 \
--max_output_length 512 \
--learning_rate 1e-3 \
--seed 42 \
--adam_epsilon 1e-8 \
--weight_decay 0.01 \
--num_epochs 5 \
--train_batch_size 8 \
--eval_batch_size 8 \
--max_grad_norm 1.0 \
--optimizer AdamW \
--scheduler CosineAnnealingLR \
--log_folder ./Log_Files/ \
--log_file finetuned_t5_.log
- To run the entire pipeline:
python pipeline.py \
--input 'Count of patients with paracetamol and brufen'
/
├── .gitignore
├── activateEnv.sh
├── Data/
│ ├── Chia_w_scope_data.csv
│ ├── Chia_wo_scope_data.csv
│ ├── PreparedText2SQL/
│ │ ├── test.csv
│ │ ├── train.csv
│ │ └── validation.csv
│ └── Text2SqlData/
│ ├── test.csv
│ ├── train.csv
│ └── validation.csv
├── Data_Processing/
│ ├── data_processing.py
│ ├── Data_Processing_Class.ipynb
│ └── Data_Processing_Functions.ipynb
├── download.sh
├── Exploratory_Data_Analysis/
│ └── EDA.ipynb
├── file_structure.py
├── Files/
│ ├── Documentation.docx
│ └── Trials.xlsx
├── Links.txt
├── log.py
├── Models/
│ ├── Hugging Face Tutorial/
│ │ ├── CS224N PyTorch Tutorial.ipynb
│ │ ├── Files_Created.txt
│ │ └── Hugging_Face_Transformers_Tutorial.ipynb
│ ├── MLM/
│ │ ├── clean.txt
│ │ ├── MLM_Basics.ipynb
│ │ └── output_files/
│ │ └── runs/
│ │ ├── Jun30_16-02-14_ws-l3-002/
│ │ │ ├── 1656590541.2214339/
│ │ │ │ └── events.out.tfevents.1656590541.ws-l3-002.2268523.1
│ │ │ └── events.out.tfevents.1656590541.ws-l3-002.2268523.0
│ │ └── Jun30_16-13-32_ws-l3-002/
│ │ ├── 1656591215.904018/
│ │ │ └── events.out.tfevents.1656591215.ws-l3-002.2315946.1
│ │ └── events.out.tfevents.1656591215.ws-l3-002.2315946.0
│ ├── Model_Files/
│ │ ├── ner_model.pt
│ │ ├── sql_gen_model.pt
│ │ ├── sql_gen_model_checkpoint.ckpt
│ │ ├── T5_tokenizer/
│ │ │ ├── added_tokens.json
│ │ │ ├── special_tokens_map.json
│ │ │ ├── spiece.model
│ │ │ └── tokenizer_config.json
│ │ └── tokenizer/
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer.json
│ │ ├── tokenizer_config.json
│ │ └── vocab.txt
│ ├── NER/
│ │ ├── addedLayers.py
│ │ ├── dataPreparation.py
│ │ ├── dataProcessing.py
│ │ ├── domainClassification.py
│ │ ├── Evaluation_Metrics.ipynb
│ │ ├── evaluationTools.py
│ │ ├── Extra/
│ │ │ ├── BertEntityClassification.py
│ │ │ ├── BioBertNER_from_Scratch.ipynb
│ │ │ ├── Logging.ipynb
│ │ │ ├── make_Json.py
│ │ │ ├── Named Entity Recognition.ipynb
│ │ │ ├── randomLogger.py
│ │ │ ├── randomParameters.py
│ │ │ └── Train_Val_Test_Split.ipynb
│ │ ├── Log_Files/
│ │ │ └── biobert_.log
│ │ ├── main.py
│ │ ├── Model_Differences.ipynb
│ │ ├── Ner_Model.py
│ │ └── test_split.json
│ ├── PL_Model/
│ │ ├── datasetClass.py
│ │ ├── inferencerClass.py
│ │ └── T5PL_Model.py
│ └── SQL_GEN/
│ ├── dataPreparation.py
│ ├── Extra/
│ │ ├── Data_Preparation.ipynb
│ │ └── Examples.ipynb
│ ├── get_dataset.py
│ ├── Log_Files/
│ │ └── finetuned_t5_.log
│ ├── main.py
│ ├── T5_Model.py
│ ├── trainModel.py
│ └── WikiSQL.ipynb
├── pipeline.py
├── README.md
└── requirements.txt
#######################
PIPELINE
#######################
NER PHASE
-----------------------
O Count
O of
O patients
O with
B-Drug paracetamol
O and
B-Drug brufen
-----------------------
PREPROCESSING PHASE
-----------------------
paracetamol
<ARG-DRUG><0>
brufen
<ARG-DRUG><1>
Count of patients with <ARG-DRUG><0> and <ARG-DRUG><1>
-----------------------
SQL GENERATION PHASE
-----------------------
SELECT COUNT( DISTINCT dr1.person_id) FROM ((<SCHEMA>.drug_exposure dr1 JOIN <DRUG-TEMPLATE><ARG-DRUG><0> ON dr1.drug_concept_id=concept_id) JOIN (<SCHEMA>.drug_exposure dr2 JOIN <DRUG-TEMPLATE><ARG-DRUG><1> ON dr2.drug_concept_id=concept_id) ON dr1.person_id=dr2.person_id);
#######################
Natural Language Input: Count of patients with paracetamol and brufen
SQL Query Generated: SELECT COUNT( DISTINCT dr1.person_id) FROM ((<SCHEMA>.drug_exposure dr1 JOIN <DRUG-TEMPLATE><ARG-DRUG><0> ON dr1.drug_concept_id=concept_id) JOIN (<SCHEMA>.drug_exposure dr2 JOIN <DRUG-TEMPLATE><ARG-DRUG><1> ON dr2.drug_concept_id=concept_id) ON dr1.person_id=dr2.person_id);
#######################