|
| 1 | +# Databricks |
| 2 | + |
| 3 | +Databricks is a unified analytics platform built on Apache Spark, offering data warehousing and lakehouse capabilities. |
| 4 | + |
| 5 | +## Setup |
| 6 | + |
| 7 | +1. Create a Databricks workspace and SQL Warehouse |
| 8 | +2. Generate a personal access token from your Databricks workspace |
| 9 | +3. Copy `.env.example` to `.env` and fill in your values: |
| 10 | + |
| 11 | +```bash |
| 12 | +cp .env.example .env |
| 13 | +# Edit .env with your actual credentials |
| 14 | +``` |
| 15 | + |
| 16 | +Required environment variables: |
| 17 | +- `DATABRICKS_SERVER_HOSTNAME`: Your workspace hostname (e.g., `dbc-xxxxxxxx-xxxx.cloud.databricks.com`) |
| 18 | +- `DATABRICKS_HTTP_PATH`: SQL Warehouse path (e.g., `/sql/1.0/warehouses/your-warehouse-id`) |
| 19 | +- `DATABRICKS_TOKEN`: Your personal access token |
| 20 | +- `databricks_instance_type`: Instance type name for results file naming, e.g., "2X-Large" |
| 21 | +- `DATABRICKS_CATALOG`: Unity Catalog name |
| 22 | +- `DATABRICKS_SCHEMA`: Schema name |
| 23 | +- `DATABRICKS_PARQUET_LOCATION`: S3 path to the parquet file |
| 24 | + |
| 25 | +## Running the Benchmark |
| 26 | + |
| 27 | +```bash |
| 28 | +./benchmark.sh |
| 29 | +``` |
| 30 | + |
| 31 | +## How It Works |
| 32 | + |
| 33 | +1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark |
| 34 | +2. **benchmark.py**: Orchestrates the full benchmark: |
| 35 | + - Creates the catalog and schema |
| 36 | + - Creates the `hits` table with explicit schema (including TIMESTAMP conversion) |
| 37 | + - Loads data from the parquet file using `INSERT INTO` with type conversions |
| 38 | + - Runs all queries via `run.sh` |
| 39 | + - Collects timing metrics from Databricks REST API |
| 40 | + - Outputs results to JSON in the `results/` directory |
| 41 | +3. **run.sh**: Iterates through queries.sql and executes each query |
| 42 | +4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`) |
| 43 | +5. **queries.sql**: Contains the 43 benchmark queries |
| 44 | + |
| 45 | +## Notes |
| 46 | + |
| 47 | +- Query execution times are pulled from the Databricks REST API, which provides server-side metrics |
| 48 | +- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE) |
| 49 | +- The benchmark uses Databricks SQL Connector for Python |
| 50 | +- Results include load time, data size, and individual query execution times (3 runs per query) |
| 51 | +- Results are saved to `results/{instance_type}.json` |
0 commit comments