Evaluation LLM For Text-to-SQL

This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive.

LLMs Text-to-SQL capability evaluation before 20231104

the follow our experiment execution accuracy of Spider is base on the database which is download from the Spider official website ,size only 95M.

name	Execution Accuracy	reference
GPT-4	0.762	numbersstation-eval-res
ChatGPT	0.728	numbersstation-eval-res
CodeLlama-13b-Instruct-hf_lora	0.789	sft train by our this project,only used spider train dataset, the same eval way in this project with lora SFT. The weights has pulished .
CodeLlama-13b-Instruct-hf_qlora	0.825	sft train by our this project, used around 50 thousand pieces of text-to-sql data. The same eval way in this project with lora SFT, and we make sure the training set has filtered the spider eval dataset.
CodeLlama-13b-Instruct-hf_qlora	0.774	sft train by our this project, only used spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT.
CodeLlama-7b-Instruct-hf_qlora	0.623	sft train by this project, which only used Spider train dataset, the same eval way in this project with qlora and nf4, bit4 SFT.
wizardcoder	0.610	text-to-sql-wizardcoder
CodeLlama-13b-Instruct-hf	0.556	eval in this project default param
Baichuan2-13B-Chat	0.392	eval in this project default param
llama2_13b_hf	0.449	numbersstation-eval-res
llama2_13b_hf_lora_best	0.744	sft train by our this project,only used spider train dataset, the same eval way in this project.

It's important to note that our evaluation results are obtained based on the current project's relevant parameters. We strive to provide objective assessments, but due to variations in certain parameters like the temperature value, different individuals may derive different results using different methods. These results should be taken as reference only. We welcome more peers to contribute your evaluation results (along with the corresponding parameter values).

If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.

LLMs Text-to-SQL capability evaluation before 20231208

the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the the Spider-based test-suite ,size of 1.27G, diffrent from Spider official website ,size only 95M. the model

Model	Method	EX
Llama2-7B-Chat	base	0	0	0	0	0
	lora	0.887	0.641	0.489	0.331	0.626
	qlora	0.847	0.623	0.466	0.361	0.608
Llama2-13B-Chat	base	0	0	0	0	0
	lora	0.907	0.729	0.552	0.343	0.68
	qlora	0.911	0.7	0.552	0.319	0.664
CodeLlama-7B-Instruct	base	0.214	0.177	0.092	0.036	0.149
	lora	0.923	0.756	0.586	0.349	0.702
	qlora	0.911	0.751	0.598	0.331	0.696
CodeLlama-13B-Instruct	base	0.698	0.601	0.408	0.271	0.539
	lora	0.94	0.789	0.684	0.404	0.746
	qlora	0.94	0.774	0.626	0.392	0.727
Baichuan2-7B-Chat	base	0.577	0.352	0.201	0.066	335
	lora	0.871	0.63	0.448	0.295	0.603
	qlora	0.891	0.637	0.489	0.331	0.624
Baichuan2-13B-Chat	base	0.581	0.413	0.264	0.187	0.392
	lora	0.903	0.702	0.569	0.392	0.678
	qlora	0.895	0.675	0.58	0.343	0.659
Qwen-7B-Chat	base	0.395	0.256	0.138	0.042	0.235
	lora	0.855	0.688	0.575	0.331	0.652
	qlora	0.911	0.675	0.575	0.343	0.662
Qwen-14B-Chat	base	0.871	0.632	0.368	0.181	0.573
	lora	0.895	0.702	0.552	0.331	0.663
	qlora	0.919	0.744	0.598	0.367	0.701
ChatGLM3-6b	base	0	0	0	0	0
	lora	0.855	0.605	0.477	0.271	0.59
	qlora	0.843	0.603	0.506	0.211	0.581

1、All the models lora and qlora are trained by default based on the spider dataset training set.
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.

2. Acknowledgements

Thanks to the following open source projects.

text-to-sql-wizardcoder
nsql-llama-2-7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval_llm_result.md

eval_llm_result.md

Evaluation LLM For Text-to-SQL

LLMs Text-to-SQL capability evaluation before 20231104

LLMs Text-to-SQL capability evaluation before 20231208

2. Acknowledgements

Files

eval_llm_result.md

Latest commit

History

eval_llm_result.md

File metadata and controls

Evaluation LLM For Text-to-SQL

LLMs Text-to-SQL capability evaluation before 20231104

LLMs Text-to-SQL capability evaluation before 20231208

2. Acknowledgements