This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive.
the follow our experiment execution accuracy of Spider is base on the database which is download from the Spider official website ,size only 95M.
name | Execution Accuracy | reference |
---|---|---|
GPT-4 | 0.762 | numbersstation-eval-res |
ChatGPT | 0.728 | numbersstation-eval-res |
CodeLlama-13b-Instruct-hf_lora | 0.789 | sft train by our this project,only used spider train dataset, the same eval way in this project with lora SFT. The weights has pulished . |
CodeLlama-13b-Instruct-hf_qlora | 0.825 | sft train by our this project, used around 50 thousand pieces of text-to-sql data. The same eval way in this project with lora SFT, and we make sure the training set has filtered the spider eval dataset. |
CodeLlama-13b-Instruct-hf_qlora | 0.774 | sft train by our this project, only used spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT. |
CodeLlama-7b-Instruct-hf_qlora | 0.623 | sft train by this project, which only used Spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT. |
wizardcoder | 0.610 | text-to-sql-wizardcoder |
CodeLlama-13b-Instruct-hf | 0.556 | eval in this project default param |
Baichuan2-13B-Chat | 0.392 | eval in this project default param |
llama2_13b_hf | 0.449 | numbersstation-eval-res |
llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
It's important to note that our evaluation results are obtained based on the current project's relevant parameters. We strive to provide objective assessments, but due to variations in certain parameters like the temperature value, different individuals may derive different results using different methods. These results should be taken as reference only. We welcome more peers to contribute your evaluation results (along with the corresponding parameter values).
If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.
the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the the Spider-based test-suite ,size of 1.27G, diffrent from Spider official website ,size only 95M. the model
Model | Method | EX | ||||
---|---|---|---|---|---|---|
Llama2-7B-Chat | base | 0 | 0 | 0 | 0 | 0 |
lora | 0.887 | 0.641 | 0.489 | 0.331 | 0.626 | |
qlora | 0.847 | 0.623 | 0.466 | 0.361 | 0.608 | |
Llama2-13B-Chat | base | 0 | 0 | 0 | 0 | 0 |
lora | 0.907 | 0.729 | 0.552 | 0.343 | 0.68 | |
qlora | 0.911 | 0.7 | 0.552 | 0.319 | 0.664 | |
CodeLlama-7B-Instruct | base | 0.214 | 0.177 | 0.092 | 0.036 | 0.149 |
lora | 0.923 | 0.756 | 0.586 | 0.349 | 0.702 | |
qlora | 0.911 | 0.751 | 0.598 | 0.331 | 0.696 | |
CodeLlama-13B-Instruct | base | 0.698 | 0.601 | 0.408 | 0.271 | 0.539 |
lora | 0.94 | 0.789 | 0.684 | 0.404 | 0.746 | |
qlora | 0.94 | 0.774 | 0.626 | 0.392 | 0.727 | |
Baichuan2-7B-Chat | base | 0.577 | 0.352 | 0.201 | 0.066 | 335 |
lora | 0.871 | 0.63 | 0.448 | 0.295 | 0.603 | |
qlora | 0.891 | 0.637 | 0.489 | 0.331 | 0.624 | |
Baichuan2-13B-Chat | base | 0.581 | 0.413 | 0.264 | 0.187 | 0.392 |
lora | 0.903 | 0.702 | 0.569 | 0.392 | 0.678 | |
qlora | 0.895 | 0.675 | 0.58 | 0.343 | 0.659 | |
Qwen-7B-Chat | base | 0.395 | 0.256 | 0.138 | 0.042 | 0.235 |
lora | 0.855 | 0.688 | 0.575 | 0.331 | 0.652 | |
qlora | 0.911 | 0.675 | 0.575 | 0.343 | 0.662 | |
Qwen-14B-Chat | base | 0.871 | 0.632 | 0.368 | 0.181 | 0.573 |
lora | 0.895 | 0.702 | 0.552 | 0.331 | 0.663 | |
qlora | 0.919 | 0.744 | 0.598 | 0.367 | 0.701 | |
ChatGLM3-6b | base | 0 | 0 | 0 | 0 | 0 |
lora | 0.855 | 0.605 | 0.477 | 0.271 | 0.59 | |
qlora | 0.843 | 0.603 | 0.506 | 0.211 | 0.581 | |
1、All the models lora and qlora are trained by default based on the spider dataset training set.
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.
Thanks to the following open source projects.