A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.
This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It aims to bring software engineering best practices—such as modular architecture, automated testing, and CI/CD—into the world of data engineering. By combining a clean project structure with robust development and deployment workflows, this template helps teams move faster with confidence.
You’re encouraged to adapt the structure and tooling to suit your project’s specific needs and environment.
Interested in bringing these principles in your own project? Let’s connect on Linkedin.
- Databricks Free Edition (Serverless)
- PySpark 3.4+
- Databricks Asset Bundles
- Databricks DQX
- Databricks Jobs
- Databricks Unity Catalog
- Python 3.10+
- GitHub Actions
- Pytest
This project template demonstrates how to:
- structure PySpark code inside classes/packages.
- structure unit tests for the data transformations and set up VSCode to run them on your local machine.
- structure integration tests to be executed on different environments / catalogs.
- utilize Databricks Asset Bundles to package/deploy/run a Python wheel package on Databricks.
- utilize Databricks DQX to define and enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation.
- utilize a medallion architecture pattern.
- package and deploy code to different environments (dev, staging, prod) using a CI/CD pipeline with Github Actions.
- isolate "dev" environments / catalogs to avoid concurrency issues between developers testing jobs.
- configure the workflow to run in different environments with different parameters with jinja package.
- configure the workflow to run tasks selectively.
- lint and format code with ruff and pre-commit.
- use a Make file to automate repetitive tasks.
- utilize pipenv/Pipfile to prepare local and remote envs.
- utilize pytest package to run unit tests on transformations and generate test coverage reports.
- utilize argparse package to build a flexible command line interface to start the jobs.
- utilize funcy package to log the execution time of each transformation.
- utilize Databricks SDK for Python to manage workspaces and accounts. The sample script enables metastore system tables with relevant data about billing, usage, lineage, prices, and access.
- utilize Databricks Unity Catalog and get data lineage for your tables and columns and a simplified permission model for your data.
- utilize Databricks Lakeflow Jobs to execute a DAG and task parameters to share context information between tasks (see Task Parameters section). Yes, you don't need Airflow to manage your DAGs here!!!
- utilize serverless clusters on Databricks Free Edition to deploy your pipelines.
- utilize Databricks job clusters to reduce costs.
- define Databricks clusters on AWS and Azure.
For a debate on the use of notebooks vs. Python packaging, please refer to:
- The Rise of The Notebook Engineer
- Please don’t make me use Databricks notebooks
- this Linkedin thread
- this Linkedin thread
Sessions on Databricks Asset Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:
- CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions
- Deploying Databricks Asset Bundles (DABs) at Scale
- A Prescription for Success: Leveraging DABs for Faster Deployment and Better Patient Outcomes
option 1) utilize a Databricks Free Edition workspace.
option 2) create a Premium workspace. Follow instructions here
Follow the instructions here
make install & make test
You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.
option 1) for Databricks Free Edition use:
make deploy-serverless env=dev
make deploy-serverless env=staging
make deploy-serverless env=prod
option 2) for Premium workspace:
Update "job_clusters" properties on wf_template.yml file. There are different properties for AWS and Azure.
make deploy env=dev
make deploy env=staging
make deploy env=prod
Configure Github Actions repository secrets DATABRICKS_HOST and DATABRICKS_TOKEN.
- task (required) - determines the current task to be executed.
- env (required) - determines the AWS account where the job is running. This parameter also defines the default catalog for the task.
- user (required) - determines the name of the catalog when env is "dev".
- schema (optional) - determines the default schema to read/store tables.
- skip (optional) - determines if the current task should be skipped.
- debug (optional) - determines if the current task should go through debug conditional.