Skip to content

Latest commit

 

History

History
307 lines (291 loc) · 8.79 KB

eval_llm_result.md

File metadata and controls

307 lines (291 loc) · 8.79 KB

Evaluation LLM For Text-to-SQL

This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive.

LLMs Text-to-SQL capability evaluation before 20231104

the follow our experiment execution accuracy of Spider is base on the database which is download from the Spider official website ,size only 95M.

name Execution Accuracy reference
GPT-4 0.762 numbersstation-eval-res
ChatGPT 0.728 numbersstation-eval-res
CodeLlama-13b-Instruct-hf_lora 0.789 sft train by our this project,only used spider train dataset, the same eval way in this project with lora SFT. The weights has pulished .
CodeLlama-13b-Instruct-hf_qlora 0.825 sft train by our this project, used around 50 thousand pieces of text-to-sql data. The same eval way in this project with lora SFT, and we make sure the training set has filtered the spider eval dataset.
CodeLlama-13b-Instruct-hf_qlora 0.774 sft train by our this project, only used spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT.
CodeLlama-7b-Instruct-hf_qlora 0.623 sft train by this project, which only used Spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT.
wizardcoder 0.610 text-to-sql-wizardcoder
CodeLlama-13b-Instruct-hf 0.556 eval in this project default param
Baichuan2-13B-Chat 0.392 eval in this project default param
llama2_13b_hf 0.449 numbersstation-eval-res
llama2_13b_hf_lora_best 0.744 sft train by our this project,only used spider train dataset, the same eval way in this project.

It's important to note that our evaluation results are obtained based on the current project's relevant parameters. We strive to provide objective assessments, but due to variations in certain parameters like the temperature value, different individuals may derive different results using different methods. These results should be taken as reference only. We welcome more peers to contribute your evaluation results (along with the corresponding parameter values).

If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.

LLMs Text-to-SQL capability evaluation before 20231208

the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the the Spider-based test-suite ,size of 1.27G, diffrent from Spider official website ,size only 95M. the model

Model Method EX
Llama2-7B-Chat base 0 0 0 0 0
lora 0.887 0.641 0.489 0.331 0.626
qlora 0.847 0.623 0.466 0.361 0.608
Llama2-13B-Chat base 0 0 0 0 0
lora 0.907 0.729 0.552 0.343 0.68
qlora 0.911 0.7 0.552 0.319 0.664
CodeLlama-7B-Instruct base 0.214 0.177 0.092 0.036 0.149
lora 0.923 0.756 0.586 0.349 0.702
qlora 0.911 0.751 0.598 0.331 0.696
CodeLlama-13B-Instruct base 0.698 0.601 0.408 0.271 0.539
lora 0.94 0.789 0.684 0.404 0.746
qlora 0.94 0.774 0.626 0.392 0.727
Baichuan2-7B-Chat base 0.577 0.352 0.201 0.066 335
lora 0.871 0.63 0.448 0.295 0.603
qlora 0.891 0.637 0.489 0.331 0.624
Baichuan2-13B-Chat base 0.581 0.413 0.264 0.187 0.392
lora 0.903 0.702 0.569 0.392 0.678
qlora 0.895 0.675 0.58 0.343 0.659
Qwen-7B-Chat base 0.395 0.256 0.138 0.042 0.235
lora 0.855 0.688 0.575 0.331 0.652
qlora 0.911 0.675 0.575 0.343 0.662
Qwen-14B-Chat base 0.871 0.632 0.368 0.181 0.573
lora 0.895 0.702 0.552 0.331 0.663
qlora 0.919 0.744 0.598 0.367 0.701
ChatGLM3-6b base 0 0 0 0 0
lora 0.855 0.605 0.477 0.271 0.59
qlora 0.843 0.603 0.506 0.211 0.581

1、All the models lora and qlora are trained by default based on the spider dataset training set.
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.

2. Acknowledgements

Thanks to the following open source projects.