Skip to content

Commit 9e8e52b

Browse files
tutorial: document S3 tables integration with Snowflake (#407)
Co-authored-by: Quetzalli <hola@quetzalliwrites.com>
1 parent e4492cf commit 9e8e52b

File tree

1 file changed

+252
-0
lines changed

1 file changed

+252
-0
lines changed
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
---
2+
title: Querying S3 Tables with Snowflake
3+
description: In this tutorial, you will learn how to integrate AWS S3 Tables with Snowflake to query Iceberg tables stored in S3 Tables buckets through LocalStack.
4+
template: doc
5+
nav:
6+
label:
7+
---
8+
9+
## Introduction
10+
11+
In this tutorial, you will explore how to connect Snowflake to AWS S3 Tables locally using LocalStack. S3 Tables is a managed Apache Iceberg table catalog that uses S3 storage, providing built-in maintenance features like automatic compaction and snapshot management.
12+
13+
With LocalStack's Snowflake emulator, you can create catalog integrations that connect to S3 Tables and query Iceberg tables without needing cloud resources. This integration allows you to:
14+
15+
- Create catalog integrations to connect Snowflake to S3 Tables.
16+
- Query existing Iceberg tables stored in S3 Tables buckets.
17+
- Leverage automatic schema inference from external Iceberg tables.
18+
19+
## Prerequisites
20+
21+
- [`localstack` CLI](/snowflake/getting-started/) with a [`LOCALSTACK_AUTH_TOKEN`](/aws/getting-started/auth-token/)
22+
- [LocalStack for Snowflake](/snowflake/getting-started/)
23+
- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) & [`awslocal` wrapper](/aws/integrations/aws-native-tools/aws-cli/#localstack-aws-cli-awslocal)
24+
- Python 3.10+ with `pyiceberg` and `pyarrow` installed
25+
26+
## Start LocalStack
27+
28+
Start your LocalStack container with the Snowflake emulator enabled.
29+
30+
```bash
31+
export LOCALSTACK_AUTH_TOKEN=<your_auth_token>
32+
localstack start --stack snowflake
33+
```
34+
35+
## Create S3 Tables resources
36+
37+
Before configuring Snowflake, you need to create S3 Tables resources using the AWS CLI. This includes a table bucket and a namespace.
38+
39+
### Create a table bucket
40+
41+
Create a table bucket to store your Iceberg tables.
42+
43+
```bash
44+
awslocal s3tables create-table-bucket --name my-table-bucket
45+
```
46+
47+
```bash title="Output"
48+
{
49+
"arn": "arn:aws:s3tables:us-east-1:000000000000:bucket/my-table-bucket"
50+
}
51+
```
52+
53+
### Create a namespace
54+
55+
Create a namespace within the table bucket to organize your tables.
56+
57+
```bash
58+
awslocal s3tables create-namespace \
59+
--table-bucket-arn arn:aws:s3tables:us-east-1:000000000000:bucket/my-table-bucket \
60+
--namespace my_namespace
61+
```
62+
63+
```bash title="Output"
64+
{
65+
"tableBucketARN": "arn:aws:s3tables:us-east-1:000000000000:bucket/my-table-bucket",
66+
"namespace": [
67+
"my_namespace"
68+
]
69+
}
70+
```
71+
72+
## Create and populate a table in S3 Tables
73+
74+
To query data from Snowflake using `CATALOG_TABLE_NAME`, the S3 Tables table must have a defined schema and contain data. Use PyIceberg to create a table with schema and populate it with data.
75+
76+
First, install the required Python packages:
77+
78+
```bash
79+
pip install "pyiceberg[s3fs,pyarrow]" boto3
80+
```
81+
82+
Create a Python script named `setup_s3_tables.py` with the following content:
83+
84+
```python
85+
import pyarrow as pa
86+
from pyiceberg.catalog.rest import RestCatalog
87+
from pyiceberg.schema import Schema
88+
from pyiceberg.types import NestedField, StringType, LongType
89+
90+
# Configuration
91+
LOCALSTACK_URL = "http://localhost.localstack.cloud:4566"
92+
S3TABLES_URL = "http://s3tables.localhost.localstack.cloud:4566"
93+
TABLE_BUCKET_NAME = "my-table-bucket"
94+
NAMESPACE = "my_namespace"
95+
TABLE_NAME = "customer_orders"
96+
REGION = "us-east-1"
97+
98+
# Create PyIceberg REST catalog pointing to S3 Tables
99+
catalog = RestCatalog(
100+
name="s3tables_catalog",
101+
uri=f"{S3TABLES_URL}/iceberg",
102+
warehouse=TABLE_BUCKET_NAME,
103+
**{
104+
"s3.region": REGION,
105+
"s3.endpoint": LOCALSTACK_URL,
106+
"client.access-key-id": "000000000000",
107+
"client.secret-access-key": "test",
108+
"rest.sigv4-enabled": "true",
109+
"rest.signing-name": "s3tables",
110+
"rest.signing-region": REGION,
111+
},
112+
)
113+
114+
# Define table schema
115+
schema = Schema(
116+
NestedField(field_id=1, name="order_id", field_type=StringType(), required=False),
117+
NestedField(field_id=2, name="customer_name", field_type=StringType(), required=False),
118+
NestedField(field_id=3, name="amount", field_type=LongType(), required=False),
119+
)
120+
121+
# Create table in S3 Tables
122+
catalog.create_table(
123+
identifier=(NAMESPACE, TABLE_NAME),
124+
schema=schema,
125+
)
126+
127+
print(f"Created table: {NAMESPACE}.{TABLE_NAME}")
128+
129+
# Reload the table to get the latest metadata
130+
table = catalog.load_table((NAMESPACE, TABLE_NAME))
131+
132+
# Populate table with sample data
133+
data = pa.table({
134+
"order_id": ["ORD001", "ORD002", "ORD003"],
135+
"customer_name": ["Alice", "Bob", "Charlie"],
136+
"amount": [100, 250, 175],
137+
})
138+
139+
table.append(data)
140+
print("Inserted sample data into table")
141+
142+
# Verify table exists
143+
tables = catalog.list_tables(NAMESPACE)
144+
print(f"Tables in namespace: {tables}")
145+
```
146+
147+
Run the script to create the table and populate it with data:
148+
149+
```bash
150+
python setup_s3_tables.py
151+
```
152+
153+
```bash title="Output"
154+
Created table: my_namespace.customer_orders
155+
Inserted sample data into table
156+
Tables in namespace: [('my_namespace', 'customer_orders')]
157+
```
158+
159+
## Connect to the Snowflake emulator
160+
161+
Connect to the locally running Snowflake emulator using an SQL client of your choice (such as DBeaver). The Snowflake emulator runs on `snowflake.localhost.localstack.cloud`.
162+
163+
You can use the following connection parameters:
164+
165+
| Parameter | Value |
166+
|-----------|-------|
167+
| Host | `snowflake.localhost.localstack.cloud` |
168+
| User | `test` |
169+
| Password | `test` |
170+
| Account | `test` |
171+
| Warehouse | `test` |
172+
173+
## Create a catalog integration
174+
175+
Create a catalog integration to connect Snowflake to your S3 Tables bucket. The catalog integration defines how Snowflake connects to the external Iceberg REST catalog provided by S3 Tables.
176+
177+
```sql
178+
CREATE OR REPLACE CATALOG INTEGRATION s3tables_catalog_integration
179+
CATALOG_SOURCE=ICEBERG_REST
180+
TABLE_FORMAT=ICEBERG
181+
CATALOG_NAMESPACE='my_namespace'
182+
REST_CONFIG=(
183+
CATALOG_URI='http://s3tables.localhost.localstack.cloud:4566/iceberg'
184+
CATALOG_NAME='my-table-bucket'
185+
)
186+
REST_AUTHENTICATION=(
187+
TYPE=AWS_SIGV4
188+
AWS_ACCESS_KEY_ID='000000000000'
189+
AWS_SECRET_ACCESS_KEY='test'
190+
AWS_REGION='us-east-1'
191+
AWS_SERVICE='s3tables'
192+
)
193+
ENABLED=TRUE
194+
REFRESH_INTERVAL_SECONDS=60;
195+
```
196+
197+
In the above query:
198+
199+
- `CATALOG_SOURCE=ICEBERG_REST` specifies that the catalog uses the Iceberg REST protocol.
200+
- `TABLE_FORMAT=ICEBERG` indicates the table format.
201+
- `CATALOG_NAMESPACE='my_namespace'` sets the default namespace to query tables from.
202+
- `REST_CONFIG` configures the connection to the LocalStack S3 Tables REST API endpoint.
203+
- `REST_AUTHENTICATION` configures AWS SigV4 authentication for the S3 Tables service.
204+
- `REFRESH_INTERVAL_SECONDS=60` sets how often Snowflake refreshes metadata from the catalog.
205+
206+
## Create an Iceberg table referencing S3 Tables
207+
208+
Create an Iceberg table in Snowflake that references the existing S3 Tables table using `CATALOG_TABLE_NAME`. The schema is automatically inferred from the external table.
209+
210+
```sql
211+
CREATE OR REPLACE ICEBERG TABLE iceberg_customer_orders
212+
CATALOG='s3tables_catalog_integration'
213+
CATALOG_TABLE_NAME='my_namespace.customer_orders'
214+
AUTO_REFRESH=TRUE;
215+
```
216+
217+
In the above query:
218+
219+
- `CATALOG` references the catalog integration created in the previous step.
220+
- `CATALOG_TABLE_NAME` specifies the fully-qualified table name in the format `namespace.table_name`.
221+
- `AUTO_REFRESH=TRUE` enables automatic refresh of table metadata.
222+
- No column definitions are needed as the schema is inferred from the existing S3 Tables table.
223+
224+
## Query the Iceberg table
225+
226+
You can now query the Iceberg table like any other Snowflake table. The schema (columns) are automatically available from the external table.
227+
228+
```sql
229+
SELECT * FROM iceberg_customer_orders;
230+
```
231+
232+
```sql title="Output"
233+
+----------+---------------+--------+
234+
| order_id | customer_name | amount |
235+
+----------+---------------+--------+
236+
| ORD001 | Alice | 100 |
237+
| ORD002 | Bob | 250 |
238+
| ORD003 | Charlie | 175 |
239+
+----------+---------------+--------+
240+
```
241+
242+
## Conclusion
243+
244+
In this tutorial, you learned how to integrate AWS S3 Tables with Snowflake using LocalStack. You created S3 Tables resources, populated a table with data using PyIceberg, configured a catalog integration in Snowflake, and queried Iceberg tables stored in S3 Tables buckets using `CATALOG_TABLE_NAME`.
245+
246+
The S3 Tables integration enables you to:
247+
248+
- Query data stored in S3 Tables using familiar Snowflake SQL syntax.
249+
- Leverage automatic schema inference from external Iceberg catalogs.
250+
- Develop and test your data lakehouse integrations locally without cloud resources.
251+
252+
LocalStack's Snowflake emulator combined with S3 Tables support provides a complete local environment for developing and testing multi-platform data analytics workflows.

0 commit comments

Comments
 (0)