A source code for RYANSQL, a text-to-SQL system for complex, cross-domain databases.
Reference Paper: Choi et al., RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases, 2020
The system is submitted to SPIDER leaderboard. The system and its minor improved version RYANSQL v2 is ranked at second and fourth place (as of February 2020).
The system does NOT use any database records, which make it more acceptable to the real world company applications.
Python3
Tensorflow 1.14
nltk
Download the BERT pretrained model. You can only download the model, not the whole git. The system uses BERT-large, uncased with Whole Word Masking model. Unzip the downloaded file.
Download the SPIDER dataset from https://yale-lily.github.io/spider. Unzip the downloaded file.
Run:
python src/trainer.py [BERT_DIR] [SPIDER_DATASET_DIR]
An example is:
python src/trainer.py ./wwm_uncased_L-24_H-1024_A-16 ./spider
The training takes about a day using a single Tesla V100 GPU. The dev set performance during the training shows the exact slot matching performance, including ordering; it will range between 55 to 57 % for the final model.
The required files of the SPIDER dataset are: tables.json
, train_spider.json
, train_others.json
, plus dev.json
for testing.
Clone the Spider git (https://github.com/taoyds/spider), and add its local directory to python sys.path.
Run:
python src/actual_test.py [MODEL_PATH] [BERT_DIR] [SPIDER_DATASET_DIR] [OUT_FILE]
to get the resultant SQL statements for the development set. The generated output file then could be evaluated using the SPIDER's evaluation script.
The performance of evaluation script with the final model will range from 64 to 66 %, since the ordering of conditions is not important for an actual SQL statement.
The required files for SPIDER dataset is, table.json
for database schema information, and dev.json
for development dataset.