A demonstration project for the Manchester Databricks User Group showing how to run local unit tests using the official Databricks Runtime Docker image.
Testing PySpark code locally with the exact same runtime as production ensures:
- No surprises from version mismatches
- Fast feedback loops during development
- CI pipelines that catch issues before deployment
- Docker
- VS Code with Dev Containers extension
- Clone the repo
- Open in VS Code
- Click "Reopen in Container" when prompted
- Run tests:
pytest -v
That's it. You're running tests against Databricks Runtime 17.3 LTS (Spark 4.0.0) locally.
├── src/local_pyspark_testing/
│ ├── environment.py # Spark session factory (local vs Databricks)
│ ├── transforms.py # UDFs using pycountry
│ └── jobs/ # Pipeline entry points
│ ├── bronze_to_silver.py
│ └── silver_to_gold.py
├── tests/
│ └── test_transforms.py # DataFrame-based unit tests
├── resources/ # Databricks Asset Bundle configs
│ ├── clusters.yml
│ └── jobs.yml
├── Dockerfile # Dual-mode: CI wheel install or dev editable
└── .devcontainer/ # VS Code devcontainer config
The same Dockerfile serves both CI and local development:
ARG INSTALL_MODE=wheel
# CI: installs pre-built wheel
# Dev: skips, postCreateCommand handles editable install
RUN if [ "$INSTALL_MODE" = "wheel" ]; then \
uv pip install --system --break-system-packages *.whl pytest; \
fienvironment.py creates the right Spark session based on context:
def get_spark() -> SparkSession:
if os.environ.get("LOCAL_SPARK") == "true":
return _create_local_spark() # Optimized for testing
return _create_databricks_spark() # Uses existing cluster session- Build -
uv build --wheelcreates the package - Test - Docker runs pytest against the wheel
- Push - Image pushed to GitHub Container Registry
- Deploy - Databricks Asset Bundles updates clusters and jobs
Tests use assertDataFrameEqual for DataFrame comparisons:
def test_converts_valid_codes(self, spark_session):
df = spark_session.createDataFrame([("GB",), ("DE",)], ["country_code"])
result = df.withColumn("country_name", country_code_to_name("country_code"))
expected = spark_session.createDataFrame(
[("GB", "United Kingdom"), ("DE", "Germany")],
["country_code", "country_name"],
)
assertDataFrameEqual(result, expected)DATABRICKS_HOST- Workspace URLDATABRICKS_TOKEN- Personal access token
Enable Container Services in Admin Settings (or via CLI):
databricks workspace-conf set-status --json '{"enableDcs": "true"}'MIT