Skip to content

Commit 23a833a

Browse files
authored
Docs: Add Vanna Chat2SQL tutorial (#170)
Signed-off-by: Zhao Yuanjie <qq157755587@gmail.com>
1 parent f3a2600 commit 23a833a

File tree

6 files changed

+330
-0
lines changed

6 files changed

+330
-0
lines changed
128 KB
Loading
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Large Language Model (LLM) Retrieval-Augmented Generation (RAG) Development Guide
2+
3+
## Large Language Models (LLM) and Retrieval-Augmented Generation (RAG)
4+
5+
Large Language Models (LLM), also known as large-scale language models, are AI models designed to understand and generate human language.
6+
7+
LLM typically refers to language models with hundreds of billions (or more) of parameters, trained on vast amounts of text data, gaining a deep understanding of language. Currently, well-known LLMs abroad include GPT-3.5, GPT-4, PaLM, Claude, and LLaMA, while domestically, there are models like Wenxin Yiyan, Xunfei Spark, Tongyi Qianwen, ChatGLM, and Baichuan.
8+
9+
To explore performance limits, many researchers have started training increasingly larger language models, such as GPT-3 with 175 billion parameters and PaLM with 540 billion parameters. Although these large language models use similar architectures and pre-training tasks as smaller models (like BERT with 330 million parameters and GPT-2 with 1.5 billion parameters), they exhibit strikingly different capabilities, especially in solving complex tasks, showing remarkable potential, known as "emergent abilities." Taking GPT-3 and GPT-2 as examples, GPT-3 can solve few-shot tasks by learning context, while GPT-2 performs poorly in this regard. Therefore, the research community has named these large models "Large Language Models (LLM)." A notable application of LLM is ChatGPT, a bold attempt to use the GPT series LLM for human-like conversational applications, demonstrating very smooth and natural performance.
10+
11+
Large Language Models (LLM) have stronger capabilities compared to traditional language models, yet they may still fail to provide accurate answers in some cases. To address the challenges faced by large language models in text generation and to improve model performance and output quality, researchers have proposed a new model architecture: Retrieval-Augmented Generation (RAG). This architecture cleverly integrates relevant information retrieved from a vast knowledge base and uses it to guide large language models in generating more accurate answers, significantly enhancing the accuracy and depth of responses. RAG has been successful in multiple fields, including question-answering systems, dialogue systems, document summarization, and document generation.
12+
13+
## Building RAG Applications in KDP
14+
15+
In KDP, it is convenient to utilize local or online large models, combined with user-specific data, to build RAG applications. Below, we illustrate how to build a RAG application using the Text to SQL scenario as an example.
16+
17+
SQL language is widely used in the field of data analysis. Although SQL is relatively close to natural language, it still has some usage barriers for business professionals:
18+
19+
- Must learn SQL syntax
20+
- Must understand table structures, clearly knowing which business data is in which tables
21+
22+
With large models, can we leverage their capabilities, combined with private table structure information, to assist us in data analysis? For example, directly asking, "Who sold the most iPhones last month?"
23+
24+
The answer is affirmative.
25+
26+
To simplify development details, we directly use [Vanna](https://github.com/vanna-ai/vanna) to implement Text to SQL. For more flexible construction, consider using other tools like [LangChain](https://github.com/langchain-ai/langchain).
27+
28+
### Component Dependencies
29+
30+
Please install the following components in KDP:
31+
32+
- ollama
33+
- milvus
34+
- jupyterlab
35+
36+
Ollama is used to run large models locally, Milvus is used to store vectorized data, and JupyterLab is the development environment.
37+
38+
### Running Large Models Locally
39+
40+
We will use [phi3](https://ollama.com/library/phi3) as an example to run a large model.
41+
42+
```shell
43+
kubectl exec -it $(kubectl get pods -l app.kubernetes.io/name=ollama -n kdp-data -o jsonpath='{.items[0].metadata.name}') -n kdp-data -- ollama pull phi3:3.8b
44+
```
45+
46+
Once successfully started, you can access the phi3 large model via `http://ollama:11434`.
47+
48+
### Preparing the Development Environment
49+
50+
Execute the following commands in the Terminal of jupyterlab:
51+
52+
```bash
53+
# Enter the current user directory, which should be mounted to a PV to prevent environment loss after pod restart
54+
cd ~
55+
# Create a Python virtual environment named vanna
56+
python -m venv vanna
57+
# Activate this virtual environment
58+
source vanna/bin/activate
59+
# Install necessary pip packages
60+
pip install -i https://mirrors.aliyun.com/pypi/simple/ 'vanna[milvus,ollama]' pyhive thrift ipykernel ipywidgets
61+
# Add the current virtual environment to the Jupyterlab kernel list
62+
python -m ipykernel install --user --name=vanna
63+
```
64+
65+
After execution, you should see a kernel named `vanna` in the Jupyterlab Launcher after a short while.
66+
67+
![kernel](images/llm-rag-01.jpg)
68+
69+
### Building Text to SQL with Vanna
70+
71+
#### Extending BaseEmbeddingFunction
72+
73+
Since Vanna does not have a built-in embedding function for milvus by default, we need to extend `BaseEmbeddingFunction`. Create a Notebook with the `vanna` kernel and enter the following code:
74+
75+
```python
76+
from vanna.ollama import Ollama
77+
from vanna.milvus import Milvus_VectorStore
78+
from milvus_model.base import BaseEmbeddingFunction
79+
from pymilvus import MilvusClient
80+
from ollama import Client
81+
from typing import List, Optional
82+
import numpy as np
83+
84+
class OllamaEmbeddingFunction(BaseEmbeddingFunction):
85+
def __init__(
86+
self,
87+
model_name: str,
88+
host: Optional[str] = None
89+
):
90+
self.model_name = model_name
91+
self.client = Client(host=host)
92+
93+
def encode_queries(self, queries: List[str]) -> List[np.array]:
94+
return self._encode(queries)
95+
96+
def __call__(self, texts: List[str]) -> List[np.array]:
97+
return self._encode(texts)
98+
99+
def encode_documents(self, documents: List[str]) -> List[np.array]:
100+
return self._encode(documents)
101+
102+
def _encode(self, texts: List[str]):
103+
return [np.array(self.client.embeddings(model=self.model_name, prompt=text)['embedding']) for text in texts]
104+
105+
class MyVanna(Milvus_VectorStore, Ollama):
106+
def __init__(self, config=None):
107+
fn = OllamaEmbeddingFunction(model_name=config['embedding_model'], host=config['ollama_host'])
108+
milvus = MilvusClient(uri=config['milvus_host'])
109+
config['embedding_function'] = fn
110+
config['milvus_client'] = milvus
111+
Milvus_VectorStore.__init__(self, config=config)
112+
Ollama.__init__(self, config=config)
113+
114+
vn = MyVanna(config={'ollama_host': 'http://ollama:11434', 'model': 'phi3:3.8b', 'embedding_model': 'phi3:3.8b', 'milvus_host': 'http://milvus:19530'})
115+
```
116+
117+
#### Scenario with a Single Table
118+
119+
Assume there is already a table in Hive with the following create table statement:
120+
121+
```sql
122+
CREATE TABLE IF NOT EXISTS test_table (id bigint, data string);
123+
```
124+
125+
In the Notebook, we continue by writing the following code:
126+
127+
```python
128+
vn.connect_to_hive(host='hive-server2-0.hive-server2.kdp-data.svc.cluster.local',
129+
dbname='default',
130+
port=10000,
131+
auth='NOSASL',
132+
user='root')
133+
134+
vn.train(ddl='CREATE TABLE IF NOT EXISTS test_table (id bigint, data string)')
135+
136+
# Ask question
137+
# You will see an output similar to this SQL:
138+
# SELECT id
139+
# FROM minio_test_2
140+
# ORDER BY data DESC
141+
# LIMIT 3
142+
# And a chart display
143+
vn.ask("What are the top 3 ids of test_table?")
144+
```
145+
146+
#### Scenario with Multiple Tables
147+
148+
Vanna provides an example SQLite database [Chinook.sqlite](https://vanna.ai/Chinook.sqlite). After downloading it, upload it to the same directory as the notebook in jupyterlab. Write the following code:
149+
150+
```python
151+
vn.connect_to_sqlite('Chinook.sqlite')
152+
153+
# Traverse all DDL statements to train the table structure
154+
df_ddl = vn.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")
155+
for ddl in df_ddl['sql'].to_list():
156+
vn.train(ddl=ddl)
157+
158+
# Ask question
159+
vn.ask(question="What are the top 10 billing countries by total billing?", allow_llm_to_see_data=True)
160+
```
161+
162+
If you choose to use another database, you can adjust the code to train specific create table statements.
163+
164+
For more examples, please refer to the [official documentation](https://vanna.ai/docs/).

docs/en/user-tutorials/tutorials.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ Users can refer to the following scenario tutorials to practice how to do data i
2323
* [Visualization of the maternal and infant shopping data on Taobao](./visualization-analysis-of-taobao's-maternal-and-infant-shopping-data.md)
2424
* [How to build a real-time data pipeline of sales' on KDP](./Real-time-incremental-data-analysis.md)
2525
* [Build up your data lake on KDP](./iceberg-quickstart.md)
26+
* [Build up LLM RAG applications on KDP](./llm-rag-guide.md)
2627
* More...
128 KB
Loading
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# 大语言模型(LLM)检索增强生成(RAG)开发向导
2+
3+
## 大语言模型(LLM)和检索增强生成(RAG)
4+
5+
大语言模型(LLM,Large Language Model),也称大型语言模型,是一种旨在理解和生成人类语言的人工智能模型。
6+
7+
LLM 通常指包含数百亿(或更多)参数的语言模型,它们在海量的文本数据上进行训练,从而获得对语言深层次的理解。目前,国外的知名 LLM 有 GPT-3.5、GPT-4、PaLM、Claude 和 LLaMA 等,国内的有文心一言、讯飞星火、通义千问、ChatGLM、百川等。
8+
9+
为了探索性能的极限,许多研究人员开始训练越来越庞大的语言模型,例如拥有 1750 亿参数的 GPT-3 和 5400 亿参数的 PaLM 。尽管这些大型语言模型与小型语言模型(例如 3.3 亿参数的 BERT 和 15 亿参数的 GPT-2)使用相似的架构和预训练任务,但它们展现出截然不同的能力,尤其在解决复杂任务时表现出了惊人的潜力,这被称为“涌现能力”。以 GPT-3 和 GPT-2 为例,GPT-3 可以通过学习上下文来解决少样本任务,而 GPT-2 在这方面表现较差。因此,科研界给这些庞大的语言模型起了个名字,称之为“大语言模型(LLM)”。LLM 的一个杰出应用就是 ChatGPT ,它是 GPT 系列 LLM 用于与人类对话式应用的大胆尝试,展现出了非常流畅和自然的表现。
10+
11+
大型语言模型(LLM)相较于传统的语言模型具有更强大的能力,然而在某些情况下,它们仍可能无法提供准确的答案。为了解决大型语言模型在生成文本时面临的一系列挑战,提高模型的性能和输出质量,研究人员提出了一种新的模型架构:检索增强生成(RAG, Retrieval-Augmented Generation)。该架构巧妙地整合了从庞大知识库中检索到的相关信息,并以此为基础,指导大型语言模型生成更为精准的答案,从而显著提升了回答的准确性与深度。RAG 已经在多个领域取得了成功,包括问答系统、对话系统、文档摘要、文档生成等。
12+
13+
## 在 KDP 中构建 RAG 应用
14+
15+
在 KDP 中可以方便的利用本地大模型或在线大模型,结合用户自身的数据,构建 RAG 应用。下面我们以 Text to SQL 场景为例,说明如何构建 RAG 应用。
16+
17+
SQL 语言被广泛应用与数据分析领域。尽管 SQL 比较接近自然语言,但对业务人员来说还是有一定的使用门槛:
18+
19+
- 必须学习 SQL 语法
20+
- 必须了解表结构,清楚的知道哪些业务数据在哪些表中
21+
22+
有了大模型后,能否利用大模型的能力,结合私有的表结构信息,帮助我们进行数据分析呢?比如直接提问「上个月谁卖出了最多的 iPhone?」
23+
24+
答案是肯定的。
25+
26+
为了简化开发细节,我们直接使用 [Vanna](https://github.com/vanna-ai/vanna) 来实现 Text to SQL。如果需要更灵活的构建,可以考虑使用 [LangChain](https://github.com/langchain-ai/langchain) 等其他工具。
27+
28+
### 组件依赖
29+
30+
请在 KDP 中安装以下组件:
31+
32+
- ollama
33+
- milvus
34+
- jupyterlab
35+
36+
ollama 用于在本地运行大模型,milvus 用于保存向量化之后的数据,而 jupyterlab 则是开发环境。
37+
38+
### 在本地运行大模型
39+
40+
我们以 [phi3](https://ollama.com/library/phi3) 为例运行大模型。
41+
42+
```shell
43+
kubectl exec -it $(kubectl get pods -l app.kubernetes.io/name=ollama -n kdp-data -o jsonpath='{.items[0].metadata.name}') -n kdp-data -- ollama pull phi3:3.8b
44+
```
45+
46+
启动成功后就可以通过 `http://ollama:11434` 来访问 phi3 大模型。
47+
48+
### 准备开发环境
49+
50+
在 jupyterlab 的 Terminal 中执行以下命令:
51+
52+
```bash
53+
# 进入当前用户目录,该目录应该是挂载到 PV 中的,以免重启 pod 后环境丢失
54+
cd ~
55+
# 创建一个名为 vanna 的 Python 虚拟环境
56+
python -m venv vanna
57+
# 激活这个虚拟环境
58+
source vanna/bin/activate
59+
# 安装必要的 pip 包
60+
pip install -i https://mirrors.aliyun.com/pypi/simple/ 'vanna[milvus,ollama]' pyhive thrift ipykernel ipywidgets
61+
# 将当前虚拟环境添加到 Jupyterlab 的 kernel 列表中
62+
python -m ipykernel install --user --name=vanna
63+
```
64+
65+
执行完毕后稍过一会儿,在 Jupyterlab Launcher 中应该能看到名为 `vanna` 的 kernel。
66+
67+
![kernel](images/llm-rag-01.jpg)
68+
69+
### 利用 Vanna 构建 Text to SQL
70+
71+
#### 扩展 BaseEmbeddingFunction
72+
73+
由于 Vanna 中默认没有内置针对 milvus 的 embedding function,我们需要对 `BaseEmbeddingFunction` 进行扩展。新建一个 `vanna` kernel 的 Notebook,写入以下代码:
74+
75+
```python
76+
from vanna.ollama import Ollama
77+
from vanna.milvus import Milvus_VectorStore
78+
from milvus_model.base import BaseEmbeddingFunction
79+
from pymilvus import MilvusClient
80+
from ollama import Client
81+
from typing import List, Optional
82+
import numpy as np
83+
84+
class OllamaEmbeddingFunction(BaseEmbeddingFunction):
85+
def __init__(
86+
self,
87+
model_name: str,
88+
host: Optional[str] = None
89+
):
90+
self.model_name = model_name
91+
self.client = Client(host=host)
92+
93+
def encode_queries(self, queries: List[str]) -> List[np.array]:
94+
return self._encode(queries)
95+
96+
def __call__(self, texts: List[str]) -> List[np.array]:
97+
return self._encode(texts)
98+
99+
def encode_documents(self, documents: List[str]) -> List[np.array]:
100+
return self._encode(documents)
101+
102+
def _encode(self, texts: List[str]):
103+
return [np.array(self.client.embeddings(model=self.model_name, prompt=text)['embedding']) for text in texts]
104+
105+
class MyVanna(Milvus_VectorStore, Ollama):
106+
def __init__(self, config=None):
107+
fn = OllamaEmbeddingFunction(model_name=config['embedding_model'], host=config['ollama_host'])
108+
milvus = MilvusClient(uri=config['milvus_host'])
109+
config['embedding_function'] = fn
110+
config['milvus_client'] = milvus
111+
Milvus_VectorStore.__init__(self, config=config)
112+
Ollama.__init__(self, config=config)
113+
114+
vn = MyVanna(config={'ollama_host': 'http://ollama:11434', 'model': 'phi3:3.8b', 'embedding_model': 'phi3:3.8b', 'milvus_host': 'http://milvus:19530'})
115+
```
116+
117+
#### 单张表的场景
118+
119+
假定 Hive 中已有一张表,建表语句为
120+
121+
```sql
122+
CREATE TABLE IF NOT EXISTS test_table (id bigint, data string);
123+
```
124+
125+
在 Notebook 中,我们继续写入以下代码:
126+
127+
```python
128+
vn.connect_to_hive(host = 'hive-server2-0.hive-server2.kdp-data.svc.cluster.local',
129+
dbname = 'default',
130+
port = 10000,
131+
auth = 'NOSASL',
132+
user = 'root')
133+
134+
vn.train(ddl='CREATE TABLE IF NOT EXISTS test_table (id bigint, data string)')
135+
136+
# 提问
137+
# 你会看到输出类似这样的SQL:
138+
# SELECT id
139+
# FROM minio_test_2
140+
# ORDER BY data DESC
141+
# LIMIT 3
142+
# 以及图表展示
143+
vn.ask("What are the top 3 ids of test_table?")
144+
```
145+
146+
#### 多张表的场景
147+
148+
Vanna 官方提供了一个 sqlite 数据库示例 [Chinook.sqlite](https://vanna.ai/Chinook.sqlite)。我们将其下载后,在 jupyterlab 中上传到 notebook 同级目录。写入以下代码:
149+
150+
```python
151+
vn.connect_to_sqlite('Chinook.sqlite')
152+
153+
# 遍历所有DDL语句来对表结构进行训练
154+
df_ddl = vn.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")
155+
for ddl in df_ddl['sql'].to_list():
156+
vn.train(ddl=ddl)
157+
158+
# 提问
159+
vn.ask(question="What are the top 10 billing countries by total billing?", allow_llm_to_see_data=True)
160+
```
161+
162+
如果你选择使用其他数据库,可以对代码进行调整,针对具体的建表语句进行训练。
163+
164+
更多用例请参考[官方文档](https://vanna.ai/docs/)

docs/zh/user-tutorials/tutorials.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,5 @@
1919
* [可视化分析淘宝母婴购物数据](./visualization-analysis-of-taobao's-maternal-and-infant-shopping-data.md)
2020
* [如何在KDP上构建一个销售侧的实时数据流水线](./Real-time-incremental-data-analysis.md)
2121
* [在KDP上构建数据湖](./iceberg-quickstart.md)
22+
* [在 KDP 上构建 LLM RAG 应用](./llm-rag-guide.md)
2223
* 更多...

0 commit comments

Comments
 (0)