Skip to content
This repository was archived by the owner on May 5, 2025. It is now read-only.

Commit 515c5e0

Browse files
authored
Update README.md
1 parent 0907b7c commit 515c5e0

File tree

1 file changed

+29
-21
lines changed

1 file changed

+29
-21
lines changed

README.md

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
# PG_AUTO_DW
2-
<img src="https://github.com/tembo-io/pg_auto_dw/blob/d055fdf57d156edcb374c2803dff69d4b6dcc773/resources/PG_AUTO_DW_LOGO.png" alt="LOGO of image" style="border-radius: 10px; width: 300px; height: auto;">
1+
# pg_auto_dw
2+
3+
<img src="https://tembo.io/_astro/graphs.CNZLRuSs_Z1YDvaO.webp" style="border-radius: 30px; width: 600px; height: auto;">
34

45
[Open-source](LICENSE) PostgreSQL Extension for Automated Data Warehouse Creation
56

@@ -14,6 +15,7 @@ From [@ryw](https://github.com/ryw) 4-18-24:
1415
> This project attempts to implement an idea I can't shake - an auto-data warehouse extension that uses LLM to inspect operational Postgres schemas, and sets up automation to create a well-formed data warehouse (whether it's Data Vault, Kimball format, etc. I don't care - just something better than a dumb dev like me would build as a DW - a pile of ingested tables, and ad-hoc derivative tables). I don't know if this project will work, but kind of fun to start something without certainty of success. But I have wanted this badly for years as a dev + data engineer.
1516
1617
## Project Vision
18+
1719
To create an open source extension that automates the data warehouse build. We aim to do this within a structured environment that incorporates best practices and harnesses the capabilities of Large Language Models (LLM) technologies.
1820

1921
**Goals:** This extension will enable users to:
@@ -25,6 +27,7 @@ To create an open source extension that automates the data warehouse build. We
2527
All these capabilities will be delivered through a [small set of intuitive functions](extension/docs/sql_functions/readme.md).
2628

2729
## Principles
30+
2831
* Build in public
2932
* Public repo
3033
* Call attention/scrutiny to the work - release every week or two with blog/tweet calling attention to your work
@@ -39,12 +42,15 @@ All these capabilities will be delivered through a [small set of intuitive funct
3942
* Ship product + demo video + documentation
4043

4144
## Data Vault
45+
4246
We are starting with automation to facilitate a data vault implementation for our data warehouse. This will be a rudimentary raw vault setup, but we hope it will lead to substantial downstream business models.
4347

4448
## Timeline
49+
4550
We're currently working on a timeline to define points of success and ensure the smooth integration of new contributors to our project. This includes creating milestones, contributor guidelines, and hosting activities such as webinars and meetups. Stay tuned!
4651

4752
## Installation
53+
4854
We are currently developing a new extension, starting with an initial set of defined [functions](extension/docs/sql_functions/readme.md) and implementing a subset of these functions in a mockup extension. This mockup version features skeletal implementations of some functions, designed just to demonstrate our envisioned capabilities as seen in the demo below. Our demo is divided into two parts: Act 1 and Act 2. If you follow along, I hope this will offer a glimpse of what to expect in the weeks ahead.
4955

5056
If you’re interested in exploring this preliminary version, please follow these steps:
@@ -54,23 +60,31 @@ If you’re interested in exploring this preliminary version, please follow thes
5460
3) Run this Codebase
5561

5662
## Demo: Act 1 - "1-Click Build"
63+
5764
> **Note:** Only use the code presented below. Any deviations may cause errors. This demo is for illustrative purposes only. It is currently tested on PGRX using the default PostgreSQL 13 instance.
5865
5966
We want to make building a data warehouse easy. And, if the source tables are well-structured and appropriately named, constructing a data warehouse can be achieved with a single call to the extension.
6067

6168
1. **Install Extension**
69+
6270
```SQL
6371
/* Installing Extension - Installs and creates sample source tables. */
6472
CREATE EXTENSION pg_auto_dw CASCADE;
6573
```
74+
6675
> **Note:** Installing this extension installs a couple source sample tables in the PUBLIC SCHEMA as well as the PG_CRYPTO extension.
76+
6777
2. **Build Data Warehouse**
78+
6879
```SQL
6980
/* Build me a Data Warehouse for tables that are Ready to Deploy */
7081
SELECT auto_dw.go();
7182
```
83+
7284
> **Note:** This will provide a build ID and some helpful function tips. Do not implement these tips at this time. They are for illustrative purposes of future functionality.
85+
7386
3. **Data Warehouse Built**
87+
7488
```SQL
7589
/* Data Warehouse Built - No More Code Required */
7690
```
@@ -81,50 +95,53 @@ flowchart LR
8195
ext -- #10711; --> build["Build Data Warehouse\nauto_dw.go()"]
8296
build -- #10711; --> DW[("DW Created")]
8397
DW --> Done(("Done"))
84-
style Start stroke-width:1px,fill:#FFFFFF,stroke:#000000
85-
style ext color:none,fill:#FFFFFF,stroke:#000000
86-
style build fill:#e3fae3,stroke:#000000
87-
style DW fill:#FFFFFF,stroke:#000000
88-
style Done stroke-width:4px,fill:#FFFFFF,stroke:#000000
8998
```
9099

91100
## Demo: Act 2 - “Auto Data Governance”
101+
92102
Sometimes it’s best to get a little push-back when creating a data warehouse, which supports appropriate data governance. In this instance a table was not ready to deploy to the data warehouse as a table column may need to be considered sensitive and handled appropriately. In this sample script, Auto DW’s engine understands the attribute is useful for analysis, but also may need to be considered sensitive. In this script the user will:
103+
93104
1) **Identify a Skipped Table**
105+
94106
```SQL
95107
/* Identify source tables skipped and not integration into the data warehouse. */
96108
SELECT schema, "table", status, status_response
97109
FROM auto_dw.source_table()
98110
WHERE status_code = 'SKIP';
99111
```
112+
100113
> **Note:** Running this code will provide an understanding of which table was skipped along with a high level reason. You should see the following output from the status_response: “Source Table was skipped as column(s) need additional context. Please run the following SQL query for more information: SELECT schema, table, column, status, status_response FROM auto_dw.source_status_detail() WHERE schema = 'public' AND table = 'customers'.”
114+
101115
2) **Identify the Root Cause**
116+
102117
```SQL
103118
/* Identify the source table column that caused the problem, understand the issue, and potential solution. */
104119
SELECT schema, "table", "column", status, confidence_level, status_response
105120
FROM auto_dw.source_column()
106121
WHERE schema = 'PUBLIC' AND "table" = 'CUSTOMER';
107122
```
123+
108124
> **Note:** Running this code will provide an understanding of which table column was skipped along with a reason in the status_response. You should see the following output: “Requires Attention: Column cannot be appropriately categorized as it may contain sensitive data. Specifically, if the zip is an extended zip it may be considered PII.”
125+
109126
3) **Decide to Institute Some Data Governance Best Practices**
127+
110128
```SQL
111129
/* Altering column length restricts the acceptance of extended ZIP codes.*/
112130
ALTER TABLE customer ALTER COLUMN zip TYPE VARCHAR(5);
113131
```
132+
114133
> **Note:** Here the choice was up to the user to make a change that facilitated LLM understanding of data sensitivity. In this case, limiting the type to VARCHAR(5) will allow the LLM to understand that this column will not contain sensitive information in the future.
134+
115135
```mermaid
116136
flowchart LR
117137
Start(("Start")) --> tbl["Identify a Skipped Table\nauto_dw.source_table()"]
118138
tbl --> col["Identify the Root Cause\nauto_dw.source_column()"]
119139
col --> DW[("Institute Data Governance\nBest Practices")]
120140
DW --> Done(("Done"))
121-
style Start stroke-width:1px,fill:#FFFFFF,stroke:#000000
122-
style tbl color:none,fill:#edf5ff,stroke:#000000
123-
style col fill:#edf5ff,stroke:#000000
124-
style DW fill:#FFFFFF,stroke:#000000
125-
style Done stroke-width:4px,fill:#FFFFFF,stroke:#000000
126141
```
142+
127143
**Auto DW Process Flow:** The script highlighted in Act 2 demonstrates that there are several approaches to successfully implementing a data warehouse when using this extension. Below is a BPMN diagram that illustrates these various paths.
144+
128145
```mermaid
129146
flowchart LR
130147
subgraph functions_informative["Informative Functions"]
@@ -150,13 +167,4 @@ flowchart LR
150167
review --> data_gov --> more_auto{"More\nAutomations?"}
151168
more_auto --> |no| done(("Done"))
152169
more_auto --> |yes| start_again(("Restart"))
153-
154-
classDef standard fill:#FFFFFF,stroke:#000000
155-
classDef informative fill:#edf5ff,stroke:#000000
156-
classDef interactive fill:#e3fae3,stroke:#000000
157-
class start,command,split,join,review standard
158-
class to_gov,gov,more_auto,start_again standard
159-
class health,source_tables,source_column informative
160-
class source_clude,update_context,go interactive
161-
style done stroke-width:4px,fill:#FFFFFF,stroke:#000000
162170
```

0 commit comments

Comments
 (0)