Skip to content

Commit 8f88615

Browse files
authored
Added more docs (#44)
1 parent 4e8da3a commit 8f88615

File tree

3 files changed

+177
-66
lines changed

3 files changed

+177
-66
lines changed

README.md

Lines changed: 126 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,157 @@
1-
# databricks-labs-lsql
1+
Databricks Labs LSQL
2+
===
23

3-
[![PyPI - Version](https://img.shields.io/pypi/v/databricks-labs-lightsql.svg)](https://pypi.org/project/databricks-labs-lightsql)
4-
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/databricks-labs-lightsql.svg)](https://pypi.org/project/databricks-labs-lightsql)
5-
6-
-----
4+
[![PyPI - Version](https://img.shields.io/pypi/v/databricks-labs-lightsql.svg)](https://pypi.org/project/databricks-labs-lsql)
5+
[![build](https://github.com/databrickslabs/ucx/actions/workflows/push.yml/badge.svg)](https://github.com/databrickslabs/lsql/actions/workflows/push.yml) [![codecov](https://codecov.io/github/databrickslabs/lsql/graph/badge.svg?token=p0WKAfW5HQ)](https://codecov.io/github/databrickslabs/ucx) [![lines of code](https://tokei.rs/b1/github/databrickslabs/lsql)]([https://codecov.io/github/databrickslabs/lsql](https://github.com/databrickslabs/lsql))
76

87
Execute SQL statements in a stateless manner.
98

10-
## Installation
9+
<!-- TOC -->
10+
* [Databricks Labs LSQL](#databricks-labs-lsql)
11+
* [Installation](#installation)
12+
* [Executing SQL](#executing-sql)
13+
* [Iterating over results](#iterating-over-results)
14+
* [Executing without iterating](#executing-without-iterating)
15+
* [Fetching one record](#fetching-one-record)
16+
* [Fetching one value](#fetching-one-value)
17+
* [Parameters](#parameters)
18+
* [SQL backend abstraction](#sql-backend-abstraction)
19+
* [Project Support](#project-support)
20+
<!-- TOC -->
21+
22+
# Installation
1123

1224
```console
1325
pip install databricks-labs-lsql
1426
```
1527

16-
## Executing SQL
28+
[[back to top](#databricks-labs-lsql)]
29+
30+
# Executing SQL
1731

18-
Primary use-case of :py:meth:`iterate_rows` and :py:meth:`execute` methods is oriented at executing SQL queries in
32+
Primary use-case of :py:meth:`fetch_all` and :py:meth:`execute` methods is oriented at executing SQL queries in
1933
a stateless manner straight away from Databricks SDK for Python, without requiring any external dependencies.
2034
Results are fetched in JSON format through presigned external links. This is perfect for serverless applications
2135
like AWS Lambda, Azure Functions, or any other containerised short-lived applications, where container startup
2236
time is faster with the smaller dependency set.
2337

24-
for (pickup_zip, dropoff_zip) in w.statement_execution.iterate_rows(warehouse_id,
25-
'SELECT pickup_zip, dropoff_zip FROM nyctaxi.trips LIMIT 10', catalog='samples'):
26-
print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')
38+
Applications, that need to a more traditional SQL Python APIs with cursors, efficient data transfer of hundreds of
39+
megabytes or gigabytes of data serialized in Apache Arrow format, and low result fetching latency, should use
40+
the stateful Databricks SQL Connector for Python.
2741

28-
Method :py:meth:`iterate_rows` returns an iterator of objects, that resemble :class:`pyspark.sql.Row` APIs, but full
29-
compatibility is not the goal of this implementation.
42+
Constructor and the most of the methods do accept [common parameters](#parameters).
3043

31-
iterate_rows = functools.partial(w.statement_execution.iterate_rows, warehouse_id, catalog='samples')
32-
for row in iterate_rows('SELECT * FROM nyctaxi.trips LIMIT 10'):
33-
pickup_time, dropoff_time = row[0], row[1]
34-
pickup_zip = row.pickup_zip
35-
dropoff_zip = row['dropoff_zip']
36-
all_fields = row.as_dict()
37-
print(f'{pickup_zip}@{pickup_time} -> {dropoff_zip}@{dropoff_time}: {all_fields}')
44+
```python
45+
from databricks.sdk import WorkspaceClient
46+
from databricks.labs.lsql.core import StatementExecutionExt
47+
w = WorkspaceClient()
48+
see = StatementExecutionExt(w)
49+
for (pickup_zip, dropoff_zip) in see('SELECT pickup_zip, dropoff_zip FROM samples.nyctaxi.trips LIMIT 10'):
50+
print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')
51+
```
3852

39-
When you only need to execute the query and have no need to iterate over results, use the :py:meth:`execute`.
53+
[[back to top](#databricks-labs-lsql)]
4054

41-
w.statement_execution.execute(warehouse_id, 'CREATE TABLE foo AS SELECT * FROM range(10)')
55+
## Iterating over results
4256

43-
## Working with dataclasses
57+
Method `fetch_all` returns an iterator of objects, that resemble `pyspark.sql.Row` APIs, but full
58+
compatibility is not the goal of this implementation. Method accepts [common parameters](#parameters).
4459

45-
This framework allows for mapping with strongly-typed dataclasses between SQL and Python runtime.
60+
```python
61+
import os
62+
from databricks.sdk import WorkspaceClient
63+
from databricks.labs.lsql.core import StatementExecutionExt
4664

47-
It handles the schema creation logic purely from Python datastructure.
65+
results = []
66+
w = WorkspaceClient()
67+
see = StatementExecutionExt(w, warehouse_id=os.environ.get("TEST_DEFAULT_WAREHOUSE_ID"))
68+
for pickup_zip, dropoff_zip in see.fetch_all("SELECT pickup_zip, dropoff_zip FROM samples.nyctaxi.trips LIMIT 10"):
69+
results.append((pickup_zip, dropoff_zip))
70+
```
4871

49-
## Mocking for unit tests
72+
[[back to top](#databricks-labs-lsql)]
5073

51-
This includes a lightweight framework to map between dataclass instances and different SQL execution backends:
52-
- `MockBackend` used for unit testing
53-
- `RuntimeBackend` used for execution within Databricks Runtime
54-
- `StatementExecutionBackend` used for reading/writing records purely through REST API
74+
## Executing without iterating
5575

56-
## Pick the library that you need
76+
When you only need to execute the query and have no need to iterate over results, use the `execute` method,
77+
which accepts [common parameters](#parameters).
5778

58-
_Simple applications_, like AWS Lambdas or Azure Functions, and scripts, that are **constrained by the size of external
59-
dependencies** or **cannot depend on compiled libraries**, like `pyarrow` (88M), `pandas` (71M), `numpy` (60M),
60-
`libarrow` (41M), `cygrpc` (30M), `libopenblas64` (22M), **need less than 5M of dependencies** (see [detailed report](docs/comparison.md)),
61-
experience the [Unified Authentication](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication),
62-
and **work only with Databricks SQL Warehouses**, should use this library.
79+
```python
80+
from databricks.sdk import WorkspaceClient
81+
from databricks.labs.lsql.core import StatementExecutionExt
6382

64-
Applications, that need the full power of Databricks Runtime locally with the full velocity of PySpark SDL, experience
65-
the [Unified Authentication](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication)
66-
across all Databricks tools, efficient data transfer serialized in Apache Arrow format, and low result fetching latency,
67-
should use the stateful [Databricks Connect 2.x](https://docs.databricks.com/en/dev-tools/databricks-connect/index.html).
83+
w = WorkspaceClient()
84+
see = StatementExecutionExt(w)
85+
see.execute("CREATE TABLE foo AS SELECT * FROM range(10)")
86+
```
6887

69-
Applications, that need to a more traditional SQL Python APIs with cursors, efficient data transfer of hundreds of
70-
megabytes or gigabytes of data serialized in Apache Arrow format, and low result fetching latency, should use
71-
the stateful [Databricks SQL Connector for Python](https://docs.databricks.com/en/dev-tools/python-sql-connector.html).
72-
73-
| ... | Databricks Connect 2.x | Databricks SQL Connector | PyODBC + ODBC Driver | Databricks Labs LightSQL |
74-
|-----------------------------------------|----------------------------------------|---------------------------------------------------|---------------------------------------------------|------------------------------------|
75-
| Light-weight mocking | no | no | no | **yes** |
76-
| Extended support for dataclasses | limited | no | no | **yes** |
77-
| Strengths | almost Databricks Runtime, but locally | works with Python ecosystem | works with ODBC ecosystem | **tiny** |
78-
| Compressed size | 60M | 51M (85%) | 44M (73.3%) | **0.8M (1.3%)** |
79-
| Uncompressed size | 312M | 280M (89.7%) | ? | **30M (9.6%)** |
80-
| Direct dependencies | 23 | 14 | 2 | **1** (Python SDK) |
81-
| Unified Authentication | yes (via Python SDK) | no | no | **yes** (via Python SDK) |
82-
| Works with | Databricks Clusters only | Databricks Clusters and Databricks SQL Warehouses | Databricks Clusters and Databricks SQL Warehouses | **Databricks SQL Warehouses only** |
83-
| Full equivalent of Databricks Runtime | yes | no | no | **no** |
84-
| Efficient memory usage via Apache Arrow | yes | yes | yes | **no** |
85-
| Connection handling | stateful | stateful | stateful | **stateless** |
86-
| Official | yes | yes | yes | **no** |
87-
| Version checked | 14.0.1 | 2.9.3 | driver v2.7.5 | 0.1.0 |
88-
89-
## Project Support
88+
[[back to top](#databricks-labs-lsql)]
89+
90+
## Fetching one record
91+
92+
Method `fetch_one` returns a single record from the result set. If the result set is empty, it returns `None`.
93+
If the result set contains more than one record, it raises `ValueError`.
94+
95+
```python
96+
from databricks.sdk import WorkspaceClient
97+
from databricks.labs.lsql.core import StatementExecutionExt
98+
99+
w = WorkspaceClient()
100+
see = StatementExecutionExt(w)
101+
pickup_zip, dropoff_zip = see.fetch_one("SELECT pickup_zip, dropoff_zip FROM samples.nyctaxi.trips LIMIT 1")
102+
print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')
103+
```
104+
105+
[[back to top](#databricks-labs-lsql)]
106+
107+
## Fetching one value
108+
109+
Method `fetch_value` returns a single value from the result set. If the result set is empty, it returns `None`.
110+
111+
```python
112+
from databricks.sdk import WorkspaceClient
113+
from databricks.labs.lsql.core import StatementExecutionExt
114+
115+
w = WorkspaceClient()
116+
see = StatementExecutionExt(w)
117+
count = see.fetch_value("SELECT COUNT(*) FROM samples.nyctaxi.trips")
118+
print(f'count={count}')
119+
```
120+
121+
[[back to top](#databricks-labs-lsql)]
122+
123+
## Parameters
124+
125+
* `warehouse_id` (str, optional) - Warehouse upon which to execute a statement. If not given, it will use the warehouse specified in the constructor or the first available warehouse that is not in the `DELETED` or `DELETING` state.
126+
* `byte_limit` (int, optional) - Applies the given byte limit to the statement's result size. Byte counts are based on internal representations and may not match measurable sizes in the JSON format.
127+
* `catalog` (str, optional) - Sets default catalog for statement execution, similar to `USE CATALOG` in SQL. If not given, it will use the default catalog or the catalog specified in the constructor.
128+
* `schema` (str, optional) - Sets default schema for statement execution, similar to `USE SCHEMA` in SQL. If not given, it will use the default schema or the schema specified in the constructor.
129+
* `timeout` (timedelta, optional) - Timeout after which the query is cancelled. If timeout is less than 50 seconds, it is handled on the server side. If the timeout is greater than 50 seconds, Databricks SDK for Python cancels the statement execution and throws `TimeoutError`. If not given, it will use the timeout specified in the constructor.
130+
131+
[[back to top](#databricks-labs-lsql)]
132+
133+
# SQL backend abstraction
134+
135+
This framework allows for mapping with strongly-typed dataclasses between SQL and Python runtime. It handles the schema
136+
creation logic purely from Python datastructure.
137+
138+
`SqlBackend` is used to define the methods that are required to be implemented by any SQL backend
139+
that is used by the library. The methods defined in this class are used to execute SQL statements,
140+
fetch results from SQL statements, and save data to tables. Available backends are:
141+
142+
- `StatementExecutionBackend` used for reading/writing records purely through REST API
143+
- `DatabricksConnectBackend` used for reading/writing records through Databricks Connect
144+
- `RuntimeBackend` used for execution within Databricks Runtime
145+
- `MockBackend` used for unit testing
146+
147+
Common methods are:
148+
- `execute(str)` - Execute a SQL statement and wait till it finishes
149+
- `fetch(str)` - Execute a SQL statement and iterate over all results
150+
- `save_table(full_name: str, rows: Sequence[DataclassInstance], klass: Dataclass)` - Save a sequence of dataclass instances to a table
151+
152+
[[back to top](#databricks-labs-lsql)]
153+
154+
# Project Support
90155
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
91156

92157
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

docs/comparison.md

Lines changed: 43 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,45 @@
11
# Library size comparison
22

3+
<!-- TOC -->
4+
* [Library size comparison](#library-size-comparison)
5+
* [Pick the library that you need](#pick-the-library-that-you-need)
6+
* [Databricks Connect](#databricks-connect)
7+
* [Databricks SQL Connector](#databricks-sql-connector)
8+
* [Databricks Labs LightSQL](#databricks-labs-lightsql)
9+
<!-- TOC -->
10+
11+
## Pick the library that you need
12+
13+
_Simple applications_, like AWS Lambdas or Azure Functions, and scripts, that are **constrained by the size of external
14+
dependencies** or **cannot depend on compiled libraries**, like `pyarrow` (88M), `pandas` (71M), `numpy` (60M),
15+
`libarrow` (41M), `cygrpc` (30M), `libopenblas64` (22M), **need less than 5M of dependencies** (see [detailed report](docs/comparison.md)),
16+
experience the [Unified Authentication](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication), and **work only with Databricks SQL Warehouses**, should use this library.
17+
18+
Applications, that need the full power of Databricks Runtime locally with the full velocity of PySpark SDL, experience
19+
the [Unified Authentication](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication)
20+
across all Databricks tools, efficient data transfer serialized in Apache Arrow format, and low result fetching latency,
21+
should use the stateful [Databricks Connect 2.x](https://docs.databricks.com/en/dev-tools/databricks-connect/index.html).
22+
23+
Applications, that need to a more traditional SQL Python APIs with cursors, efficient data transfer of hundreds of
24+
megabytes or gigabytes of data serialized in Apache Arrow format, and low result fetching latency, should use
25+
the stateful [Databricks SQL Connector for Python](https://docs.databricks.com/en/dev-tools/python-sql-connector.html).
26+
27+
| ... | Databricks Connect 2.x | Databricks SQL Connector | PyODBC + ODBC Driver | Databricks Labs LightSQL |
28+
|-----------------------------------------|----------------------------------------|---------------------------------------------------|---------------------------------------------------|------------------------------------|
29+
| Light-weight mocking | no | no | no | **yes** |
30+
| Extended support for dataclasses | limited | no | no | **yes** |
31+
| Strengths | almost Databricks Runtime, but locally | works with Python ecosystem | works with ODBC ecosystem | **tiny** |
32+
| Compressed size | 60M | 51M (85%) | 44M (73.3%) | **0.8M (1.3%)** |
33+
| Uncompressed size | 312M | 280M (89.7%) | ? | **30M (9.6%)** |
34+
| Direct dependencies | 23 | 14 | 2 | **1** (Python SDK) |
35+
| Unified Authentication | yes (via Python SDK) | no | no | **yes** (via Python SDK) |
36+
| Works with | Databricks Clusters only | Databricks Clusters and Databricks SQL Warehouses | Databricks Clusters and Databricks SQL Warehouses | **Databricks SQL Warehouses only** |
37+
| Full equivalent of Databricks Runtime | yes | no | no | **no** |
38+
| Efficient memory usage via Apache Arrow | yes | yes | yes | **no** |
39+
| Connection handling | stateful | stateful | stateful | **stateless** |
40+
| Official | yes | yes | yes | **no** |
41+
| Version checked | 14.0.1 | 2.9.3 | driver v2.7.5 | 0.1.0 |
42+
343

444
## Databricks Connect
545

@@ -67,7 +107,7 @@ Direct dependencies 23
67107
```
68108

69109

70-
### Databricks SQL Connector
110+
## Databricks SQL Connector
71111

72112
Compressed:
73113

@@ -138,11 +178,9 @@ Compressed:
138178

139179
```shell
140180
$ cd $(mktemp -d) && pip3 wheel databricks-sdk && echo "All wheels $(du -hs)" && echo "1Mb+ wheels: $(find . -type f -size +1M | xargs du -h | sort -h -r)" && cd -
141-
All wheels:
142-
816K
143-
181+
All wheels 1.8M .
144182
1Mb+ wheels:
145-
0
183+
~
146184
```
147185

148186
Uncompressed:

0 commit comments

Comments
 (0)