Skip to content

Commit 6e5fc54

Browse files
committed
Add Databricks and benchmark results for most SQL warehouse options
1 parent 6bf0126 commit 6e5fc54

File tree

18 files changed

+1163
-1
lines changed

18 files changed

+1163
-1
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
222222
- [ ] Azure Synapse
223223
- [ ] Boilingdata
224224
- [ ] CockroachDB Serverless
225-
- [ ] Databricks
226225
- [ ] DolphinDB
227226
- [ ] Dremio (without publishing)
228227
- [ ] DuckDB operating like "Athena" on remote Parquet files

databricks/.env.example

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Databricks Configuration
2+
# Copy this file to .env and fill in your actual values
3+
4+
# Your Databricks workspace hostname (e.g., dbc-xxxxxxxx-xxxx.cloud.databricks.com)
5+
DATABRICKS_SERVER_HOSTNAME=your-workspace-hostname.cloud.databricks.com
6+
7+
# SQL Warehouse HTTP path (found in your SQL Warehouse settings)
8+
# Uncomment the warehouse size you want to use
9+
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
10+
11+
# Instance type name for results file naming & results machine type label
12+
databricks_instance_type=Large
13+
14+
# Your Databricks personal access token
15+
DATABRICKS_TOKEN=your-databricks-token
16+
17+
# Unity Catalog and Schema names
18+
DATABRICKS_CATALOG=clickbench_catalog
19+
DATABRICKS_SCHEMA=clickbench_schema
20+
21+
# Parquet data location
22+
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet

databricks/NOTES.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
I created each warehouse in the Databricks UI.
2+
Besides the warehouse size, the only other change I made to default settings was to set the sleep time to 5 minutes to save money (the 4x large warehouse is very expensive).
3+
4+
Once the warehouse was created, I'd save the warehouse path to use in the .env file for each run.

databricks/README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Databricks
2+
3+
Databricks is a unified analytics platform built on Apache Spark, offering data warehousing and lakehouse capabilities.
4+
5+
## Setup
6+
7+
1. Create a Databricks workspace and SQL Warehouse
8+
2. Generate a personal access token from your Databricks workspace
9+
3. Copy `.env.example` to `.env` and fill in your values:
10+
11+
```bash
12+
cp .env.example .env
13+
# Edit .env with your actual credentials
14+
```
15+
16+
Required environment variables:
17+
- `DATABRICKS_SERVER_HOSTNAME`: Your workspace hostname (e.g., `dbc-xxxxxxxx-xxxx.cloud.databricks.com`)
18+
- `DATABRICKS_HTTP_PATH`: SQL Warehouse path (e.g., `/sql/1.0/warehouses/your-warehouse-id`)
19+
- `DATABRICKS_TOKEN`: Your personal access token
20+
- `databricks_instance_type`: Instance type name for results file naming, e.g., "2X-Large"
21+
- `DATABRICKS_CATALOG`: Unity Catalog name
22+
- `DATABRICKS_SCHEMA`: Schema name
23+
- `DATABRICKS_PARQUET_LOCATION`: S3 path to the parquet file
24+
25+
## Running the Benchmark
26+
27+
```bash
28+
./benchmark.sh
29+
```
30+
31+
## How It Works
32+
33+
1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark
34+
2. **benchmark.py**: Orchestrates the full benchmark:
35+
- Creates the catalog and schema
36+
- Creates the `hits` table with explicit schema (including TIMESTAMP conversion)
37+
- Loads data from the parquet file using `INSERT INTO` with type conversions
38+
- Runs all queries via `run.sh`
39+
- Collects timing metrics from Databricks REST API
40+
- Outputs results to JSON in the `results/` directory
41+
3. **run.sh**: Iterates through queries.sql and executes each query
42+
4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`)
43+
5. **queries.sql**: Contains the 43 benchmark queries
44+
45+
## Notes
46+
47+
- Query execution times are pulled from the Databricks REST API, which provides server-side metrics
48+
- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE)
49+
- The benchmark uses Databricks SQL Connector for Python
50+
- Results include load time, data size, and individual query execution times (3 runs per query)
51+
- Results are saved to `results/{instance_type}.json`

0 commit comments

Comments
 (0)