ClickHouse · conormccarter · Nov 7, 2025
diff --git a/README.md b/README.md
@@ -222,7 +222,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
 - [ ] Azure Synapse
 - [ ] Boilingdata
 - [ ] CockroachDB Serverless
-- [ ] Databricks
 - [ ] DolphinDB
 - [ ] Dremio (without publishing)
 - [ ] DuckDB operating like "Athena" on remote Parquet files

diff --git a/databricks/.env.example b/databricks/.env.example
@@ -0,0 +1,22 @@
+# Databricks Configuration
+# Copy this file to .env and fill in your actual values
+
+# Your Databricks workspace hostname (e.g., dbc-xxxxxxxx-xxxx.cloud.databricks.com)
+DATABRICKS_SERVER_HOSTNAME=your-workspace-hostname.cloud.databricks.com
+
+# SQL Warehouse HTTP path (found in your SQL Warehouse settings)
+# Uncomment the warehouse size you want to use
+DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
+
+# Instance type name for results file naming & results machine type label
+databricks_instance_type=Large
+
+# Your Databricks personal access token
+DATABRICKS_TOKEN=your-databricks-token
+
+# Unity Catalog and Schema names
+DATABRICKS_CATALOG=clickbench_catalog
+DATABRICKS_SCHEMA=clickbench_schema
+
+# Parquet data location
+DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet
diff --git a/databricks/NOTES.md b/databricks/NOTES.md
@@ -0,0 +1,4 @@
+I created each warehouse in the Databricks UI.
+Besides the warehouse size, the only other change I made to default settings was to set the sleep time to 5 minutes to save money (the 4x large warehouse is very expensive). 
+
+Once the warehouse was created, I'd save the warehouse path to use in the .env file for each run. 
diff --git a/databricks/README.md b/databricks/README.md
@@ -0,0 +1,47 @@
+## Setup
+
+1. Create a Databricks workspace and SQL Warehouse
+2. Generate a personal access token from your Databricks workspace
+3. Copy `.env.example` to `.env` and fill in your values:
+
+```bash
+cp .env.example .env
+# Edit .env with your actual credentials
+```
+
+Required environment variables:
+- `DATABRICKS_SERVER_HOSTNAME`: Your workspace hostname (e.g., `dbc-xxxxxxxx-xxxx.cloud.databricks.com`)
+- `DATABRICKS_HTTP_PATH`: SQL Warehouse path (e.g., `/sql/1.0/warehouses/your-warehouse-id`)
+- `DATABRICKS_TOKEN`: Your personal access token
+- `databricks_instance_type`: Instance type name for results file naming, e.g., "2X-Large"
+- `DATABRICKS_CATALOG`: Unity Catalog name
+- `DATABRICKS_SCHEMA`: Schema name
+- `DATABRICKS_PARQUET_LOCATION`: S3 path to the parquet file
+
+## Running the Benchmark
+
+```bash
+./benchmark.sh
+```
+
+## How It Works
+
+1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark
+2. **benchmark.py**: Orchestrates the full benchmark:
+   - Creates the catalog and schema
+   - Creates the `hits` table with explicit schema (including TIMESTAMP conversion)
+   - Loads data from the parquet file using `INSERT INTO` with type conversions
+   - Runs all queries via `run.sh`
+   - Collects timing metrics from Databricks REST API
+   - Outputs results to JSON in the `results/` directory
+3. **run.sh**: Iterates through queries.sql and executes each query
+4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`)
+5. **queries.sql**: Contains the 43 benchmark queries
+
+## Notes
+
+- Query execution times are pulled from the Databricks REST API, which provides server-side metrics
+- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE)
+- The benchmark uses Databricks SQL Connector for Python
+- Results include load time, data size, and individual query execution times (3 runs per query)
+- Results are saved to `results/{instance_type}.json`