Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
- [ ] Azure Synapse
- [ ] Boilingdata
- [ ] CockroachDB Serverless
- [ ] Databricks
- [ ] DolphinDB
- [ ] Dremio (without publishing)
- [ ] DuckDB operating like "Athena" on remote Parquet files
Expand Down
22 changes: 22 additions & 0 deletions databricks/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Databricks Configuration
# Copy this file to .env and fill in your actual values

# Your Databricks workspace hostname (e.g., dbc-xxxxxxxx-xxxx.cloud.databricks.com)
DATABRICKS_SERVER_HOSTNAME=your-workspace-hostname.cloud.databricks.com

# SQL Warehouse HTTP path (found in your SQL Warehouse settings)
# Uncomment the warehouse size you want to use
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id

# Instance type name for results file naming & results machine type label
databricks_instance_type=Large

# Your Databricks personal access token
DATABRICKS_TOKEN=your-databricks-token

# Unity Catalog and Schema names
DATABRICKS_CATALOG=clickbench_catalog
DATABRICKS_SCHEMA=clickbench_schema

# Parquet data location
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet
4 changes: 4 additions & 0 deletions databricks/NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
I created each warehouse in the Databricks UI.
Besides the warehouse size, the only other change I made to default settings was to set the sleep time to 5 minutes to save money (the 4x large warehouse is very expensive).

Once the warehouse was created, I'd save the warehouse path to use in the .env file for each run.
47 changes: 47 additions & 0 deletions databricks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## Setup

1. Create a Databricks workspace and SQL Warehouse
2. Generate a personal access token from your Databricks workspace
3. Copy `.env.example` to `.env` and fill in your values:

```bash
cp .env.example .env
# Edit .env with your actual credentials
```

Required environment variables:
- `DATABRICKS_SERVER_HOSTNAME`: Your workspace hostname (e.g., `dbc-xxxxxxxx-xxxx.cloud.databricks.com`)
- `DATABRICKS_HTTP_PATH`: SQL Warehouse path (e.g., `/sql/1.0/warehouses/your-warehouse-id`)
- `DATABRICKS_TOKEN`: Your personal access token
- `databricks_instance_type`: Instance type name for results file naming, e.g., "2X-Large"
- `DATABRICKS_CATALOG`: Unity Catalog name
- `DATABRICKS_SCHEMA`: Schema name
- `DATABRICKS_PARQUET_LOCATION`: S3 path to the parquet file

## Running the Benchmark

```bash
./benchmark.sh
```

## How It Works

1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark
2. **benchmark.py**: Orchestrates the full benchmark:
- Creates the catalog and schema
- Creates the `hits` table with explicit schema (including TIMESTAMP conversion)
- Loads data from the parquet file using `INSERT INTO` with type conversions
- Runs all queries via `run.sh`
- Collects timing metrics from Databricks REST API
- Outputs results to JSON in the `results/` directory
3. **run.sh**: Iterates through queries.sql and executes each query
4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`)
5. **queries.sql**: Contains the 43 benchmark queries

## Notes

- Query execution times are pulled from the Databricks REST API, which provides server-side metrics
- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE)
- The benchmark uses Databricks SQL Connector for Python
- Results include load time, data size, and individual query execution times (3 runs per query)
- Results are saved to `results/{instance_type}.json`
Loading