You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 5, 2025. It is now read-only.
[Open-source](LICENSE) PostgreSQL Extension for Automated Data Warehouse Creation
5
6
@@ -14,6 +15,7 @@ From [@ryw](https://github.com/ryw) 4-18-24:
14
15
> This project attempts to implement an idea I can't shake - an auto-data warehouse extension that uses LLM to inspect operational Postgres schemas, and sets up automation to create a well-formed data warehouse (whether it's Data Vault, Kimball format, etc. I don't care - just something better than a dumb dev like me would build as a DW - a pile of ingested tables, and ad-hoc derivative tables). I don't know if this project will work, but kind of fun to start something without certainty of success. But I have wanted this badly for years as a dev + data engineer.
15
16
16
17
## Project Vision
18
+
17
19
To create an open source extension that automates the data warehouse build. We aim to do this within a structured environment that incorporates best practices and harnesses the capabilities of Large Language Models (LLM) technologies.
18
20
19
21
**Goals:** This extension will enable users to:
@@ -25,6 +27,7 @@ To create an open source extension that automates the data warehouse build. We
25
27
All these capabilities will be delivered through a [small set of intuitive functions](extension/docs/sql_functions/readme.md).
26
28
27
29
## Principles
30
+
28
31
* Build in public
29
32
* Public repo
30
33
* Call attention/scrutiny to the work - release every week or two with blog/tweet calling attention to your work
@@ -39,12 +42,15 @@ All these capabilities will be delivered through a [small set of intuitive funct
39
42
* Ship product + demo video + documentation
40
43
41
44
## Data Vault
45
+
42
46
We are starting with automation to facilitate a data vault implementation for our data warehouse. This will be a rudimentary raw vault setup, but we hope it will lead to substantial downstream business models.
43
47
44
48
## Timeline
49
+
45
50
We're currently working on a timeline to define points of success and ensure the smooth integration of new contributors to our project. This includes creating milestones, contributor guidelines, and hosting activities such as webinars and meetups. Stay tuned!
46
51
47
52
## Installation
53
+
48
54
We are currently developing a new extension, starting with an initial set of defined [functions](extension/docs/sql_functions/readme.md) and implementing a subset of these functions in a mockup extension. This mockup version features skeletal implementations of some functions, designed just to demonstrate our envisioned capabilities as seen in the demo below. Our demo is divided into two parts: Act 1 and Act 2. If you follow along, I hope this will offer a glimpse of what to expect in the weeks ahead.
49
55
50
56
If you’re interested in exploring this preliminary version, please follow these steps:
@@ -54,23 +60,31 @@ If you’re interested in exploring this preliminary version, please follow thes
54
60
3) Run this Codebase
55
61
56
62
## Demo: Act 1 - "1-Click Build"
63
+
57
64
> **Note:** Only use the code presented below. Any deviations may cause errors. This demo is for illustrative purposes only. It is currently tested on PGRX using the default PostgreSQL 13 instance.
58
65
59
66
We want to make building a data warehouse easy. And, if the source tables are well-structured and appropriately named, constructing a data warehouse can be achieved with a single call to the extension.
> **Note:** Installing this extension installs a couple source sample tables in the PUBLIC SCHEMA as well as the PG_CRYPTO extension.
76
+
67
77
2.**Build Data Warehouse**
78
+
68
79
```SQL
69
80
/* Build me a Data Warehouse for tables that are Ready to Deploy */
70
81
SELECTauto_dw.go();
71
82
```
83
+
72
84
> **Note:** This will provide a build ID and some helpful function tips. Do not implement these tips at this time. They are for illustrative purposes of future functionality.
85
+
73
86
3.**Data Warehouse Built**
87
+
74
88
```SQL
75
89
/* Data Warehouse Built - No More Code Required */
76
90
```
@@ -81,50 +95,53 @@ flowchart LR
81
95
ext -- #10711; --> build["Build Data Warehouse\nauto_dw.go()"]
Sometimes it’s best to get a little push-back when creating a data warehouse, which supports appropriate data governance. In this instance a table was not ready to deploy to the data warehouse as a table column may need to be considered sensitive and handled appropriately. In this sample script, Auto DW’s engine understands the attribute is useful for analysis, but also may need to be considered sensitive. In this script the user will:
103
+
93
104
1)**Identify a Skipped Table**
105
+
94
106
```SQL
95
107
/* Identify source tables skipped and not integration into the data warehouse. */
96
108
SELECT schema, "table", status, status_response
97
109
FROMauto_dw.source_table()
98
110
WHERE status_code ='SKIP';
99
111
```
112
+
100
113
> **Note:** Running this code will provide an understanding of which table was skipped along with a high level reason. You should see the following output from the status_response: “Source Table was skipped as column(s) need additional context. Please run the following SQL query for more information: SELECT schema, table, column, status, status_response FROM auto_dw.source_status_detail() WHERE schema = 'public' AND table = 'customers'.”
114
+
101
115
2)**Identify the Root Cause**
116
+
102
117
```SQL
103
118
/* Identify the source table column that caused the problem, understand the issue, and potential solution. */
> **Note:** Running this code will provide an understanding of which table column was skipped along with a reason in the status_response. You should see the following output: “Requires Attention: Column cannot be appropriately categorized as it may contain sensitive data. Specifically, if the zip is an extended zip it may be considered PII.”
125
+
109
126
3)**Decide to Institute Some Data Governance Best Practices**
127
+
110
128
```SQL
111
129
/* Altering column length restricts the acceptance of extended ZIP codes.*/
112
130
ALTERTABLE customer ALTER COLUMN zip TYPE VARCHAR(5);
113
131
```
132
+
114
133
> **Note:** Here the choice was up to the user to make a change that facilitated LLM understanding of data sensitivity. In this case, limiting the type to VARCHAR(5) will allow the LLM to understand that this column will not contain sensitive information in the future.
134
+
115
135
```mermaid
116
136
flowchart LR
117
137
Start(("Start")) --> tbl["Identify a Skipped Table\nauto_dw.source_table()"]
118
138
tbl --> col["Identify the Root Cause\nauto_dw.source_column()"]
119
139
col --> DW[("Institute Data Governance\nBest Practices")]
**Auto DW Process Flow:** The script highlighted in Act 2 demonstrates that there are several approaches to successfully implementing a data warehouse when using this extension. Below is a BPMN diagram that illustrates these various paths.
0 commit comments