NL2SQL

Natural Language to Structured Language Query

Goal

The goal of the project was to work towards the development of a natural language interface that can parse a user question or statement, transform it into a structured criteria representation and produce an executable clinical data query represented as an SQL query conforming to an EHR Common Data Model.

Set Up

Clone NL2SQL Repo.

git clone https://github.com/umar1997/NL2SQL.git

Create and Activate Environment

pip install virualenv
virtualenv nlp2sqlEnv
cp ./NL2SQL/activateEnv.sh .
source activateEnv.sh

Install Dependencies

pip install -r requirements.txt

Download Chia Dataset

source download.sh
mv ./Raw_Data/* ./Data
rmdir ./Raw_Data/

Files

Download the data and model files following from here. Then add them according to the file structure as shown in the following section.

Note:

Folders: These folders should be ignored (Used for personal learning)

Hugging Face Tutorial/

MLM/

NER/Extra/

SQL_GEN/Extra/

Code Files

To create Chia_w_scope_data.csv and Chia_w_scope_data.csv run:

python ./Data_Processing/data_processing.py

To train NER model run:

cd Models/NER/

python main.py \
    --model_type dmis-lab/biobert-v1.1 \
    --tokenizer_type dmis-lab/biobert-v1.1 \
    --data_dir ./../../Data/Chia_w_scope_data.csv \
    --max_seq_length 80 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --num_epochs 5 \
    --val_split 0.30 \
    --seed 42 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --optimizer AdamW \
    --scheduler LinearWarmup \
    --log_folder ./Log_Files/ \
    --log_file biobert_.log

To train SQL Generation model run:

cd Models/SQL_GEN/

python main.py \
    --model_name mrm8488/t5-base-finetuned-wikiSQL \
    --tokenizer_name mrm8488/t5-base-finetuned-wikiSQL \
    --data_dir ./../../Data/PreparedText2SQL \
    --max_input_length 256 \
    --max_output_length 512 \
    --learning_rate 1e-3 \
    --seed 42 \
    --adam_epsilon 1e-8 \
    --weight_decay 0.01 \
    --num_epochs 5 \
    --train_batch_size 8 \
    --eval_batch_size 8 \
    --max_grad_norm 1.0 \
    --optimizer AdamW \
    --scheduler CosineAnnealingLR \
    --log_folder ./Log_Files/ \
    --log_file finetuned_t5_.log

To run the entire pipeline:

python pipeline.py \
    --input 'Count of patients with paracetamol and brufen'

File Structure

/
├── .gitignore
├── activateEnv.sh
├── Data/
│   ├── Chia_w_scope_data.csv
│   ├── Chia_wo_scope_data.csv
│   ├── PreparedText2SQL/
│   │   ├── test.csv
│   │   ├── train.csv
│   │   └── validation.csv
│   └── Text2SqlData/
│       ├── test.csv
│       ├── train.csv
│       └── validation.csv
├── Data_Processing/
│   ├── data_processing.py
│   ├── Data_Processing_Class.ipynb
│   └── Data_Processing_Functions.ipynb
├── download.sh
├── Exploratory_Data_Analysis/
│   └── EDA.ipynb
├── file_structure.py
├── Files/
│   ├── Documentation.docx
│   └── Trials.xlsx
├── Links.txt
├── log.py
├── Models/
│   ├── Hugging Face Tutorial/
│   │   ├── CS224N PyTorch Tutorial.ipynb
│   │   ├── Files_Created.txt
│   │   └── Hugging_Face_Transformers_Tutorial.ipynb
│   ├── MLM/
│   │   ├── clean.txt
│   │   ├── MLM_Basics.ipynb
│   │   └── output_files/
│   │       └── runs/
│   │           ├── Jun30_16-02-14_ws-l3-002/
│   │           │   ├── 1656590541.2214339/
│   │           │   │   └── events.out.tfevents.1656590541.ws-l3-002.2268523.1
│   │           │   └── events.out.tfevents.1656590541.ws-l3-002.2268523.0
│   │           └── Jun30_16-13-32_ws-l3-002/
│   │               ├── 1656591215.904018/
│   │               │   └── events.out.tfevents.1656591215.ws-l3-002.2315946.1
│   │               └── events.out.tfevents.1656591215.ws-l3-002.2315946.0
│   ├── Model_Files/
│   │   ├── ner_model.pt
│   │   ├── sql_gen_model.pt
│   │   ├── sql_gen_model_checkpoint.ckpt
│   │   ├── T5_tokenizer/
│   │   │   ├── added_tokens.json
│   │   │   ├── special_tokens_map.json
│   │   │   ├── spiece.model
│   │   │   └── tokenizer_config.json
│   │   └── tokenizer/
│   │       ├── special_tokens_map.json
│   │       ├── tokenizer.json
│   │       ├── tokenizer_config.json
│   │       └── vocab.txt
│   ├── NER/
│   │   ├── addedLayers.py
│   │   ├── dataPreparation.py
│   │   ├── dataProcessing.py
│   │   ├── domainClassification.py
│   │   ├── Evaluation_Metrics.ipynb
│   │   ├── evaluationTools.py
│   │   ├── Extra/
│   │   │   ├── BertEntityClassification.py
│   │   │   ├── BioBertNER_from_Scratch.ipynb
│   │   │   ├── Logging.ipynb
│   │   │   ├── make_Json.py
│   │   │   ├── Named Entity Recognition.ipynb
│   │   │   ├── randomLogger.py
│   │   │   ├── randomParameters.py
│   │   │   └── Train_Val_Test_Split.ipynb
│   │   ├── Log_Files/
│   │   │   └── biobert_.log
│   │   ├── main.py
│   │   ├── Model_Differences.ipynb
│   │   ├── Ner_Model.py
│   │   └── test_split.json
│   ├── PL_Model/
│   │   ├── datasetClass.py
│   │   ├── inferencerClass.py
│   │   └── T5PL_Model.py
│   └── SQL_GEN/
│       ├── dataPreparation.py
│       ├── Extra/
│       │   ├── Data_Preparation.ipynb
│       │   └── Examples.ipynb
│       ├── get_dataset.py
│       ├── Log_Files/
│       │   └── finetuned_t5_.log
│       ├── main.py
│       ├── T5_Model.py
│       ├── trainModel.py
│       └── WikiSQL.ipynb
├── pipeline.py
├── README.md
└── requirements.txt

Example Output

#######################
PIPELINE
#######################
NER PHASE
-----------------------
O       Count
O       of
O       patients
O       with
B-Drug  paracetamol
O       and
B-Drug  brufen
-----------------------
PREPROCESSING PHASE
-----------------------
paracetamol
        <ARG-DRUG><0>
brufen
        <ARG-DRUG><1>

Count of patients with <ARG-DRUG><0> and <ARG-DRUG><1>
-----------------------
SQL GENERATION PHASE
-----------------------
SELECT COUNT( DISTINCT dr1.person_id) FROM ((<SCHEMA>.drug_exposure dr1 JOIN <DRUG-TEMPLATE><ARG-DRUG><0> ON dr1.drug_concept_id=concept_id) JOIN (<SCHEMA>.drug_exposure dr2 JOIN <DRUG-TEMPLATE><ARG-DRUG><1> ON dr2.drug_concept_id=concept_id) ON dr1.person_id=dr2.person_id);
#######################
Natural Language Input: Count of patients with paracetamol and brufen


SQL Query Generated: SELECT COUNT( DISTINCT dr1.person_id) FROM ((<SCHEMA>.drug_exposure dr1 JOIN <DRUG-TEMPLATE><ARG-DRUG><0> ON dr1.drug_concept_id=concept_id) JOIN (<SCHEMA>.drug_exposure dr2 JOIN <DRUG-TEMPLATE><ARG-DRUG><1> ON dr2.drug_concept_id=concept_id) ON dr1.person_id=dr2.person_id);
#######################

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL2SQL

Natural Language to Structured Language Query

Goal

Set Up

Files

Code Files

File Structure

Example Output

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Data		Data
Data_Processing		Data_Processing
Exploratory_Data_Analysis		Exploratory_Data_Analysis
Files		Files
Models		Models
.gitignore		.gitignore
Links.txt		Links.txt
README.md		README.md
activateEnv.sh		activateEnv.sh
download.sh		download.sh
file_structure.py		file_structure.py
log.py		log.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

umar1997/NL2SQL

Folders and files

Latest commit

History

Repository files navigation

NL2SQL

Natural Language to Structured Language Query

Goal

Set Up

Files

Code Files

File Structure

Example Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages