A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

🚀 Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It aims to bring software engineering best practices—such as modular architecture, automated testing, and CI/CD—into the world of data engineering. By combining a clean project structure with robust development and deployment workflows, this template helps teams move faster with confidence.

You’re encouraged to adapt the structure and tooling to suit your project’s specific needs and environment.

Interested in bringing these principles in your own project? Let’s connect on Linkedin.

🧪 Technologies Used

Databricks Free Edition (Serverless)
PySpark 3.4+
Databricks Asset Bundles
Databricks DQX
Databricks Jobs
Databricks Unity Catalog
Python 3.10+
GitHub Actions
Pytest

📦 Features

This project template demonstrates how to:

structure PySpark code inside classes/packages.
structure unit tests for the data transformations and set up VSCode to run them on your local machine.
structure integration tests to be executed on different environments / catalogs.
utilize Databricks Asset Bundles to package/deploy/run a Python wheel package on Databricks.
utilize Databricks DQX to define and enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation.
utilize a medallion architecture pattern.

package and deploy code to different environments (dev, staging, prod) using a CI/CD pipeline with Github Actions.
isolate "dev" environments / catalogs to avoid concurrency issues between developers testing jobs.
configure the workflow to run in different environments with different parameters with jinja package.
configure the workflow to run tasks selectively.

lint and format code with ruff and pre-commit.
use a Make file to automate repetitive tasks.
utilize pipenv/Pipfile to prepare local and remote envs.
utilize pytest package to run unit tests on transformations and generate test coverage reports.
utilize argparse package to build a flexible command line interface to start the jobs.
utilize funcy package to log the execution time of each transformation.

utilize Databricks SDK for Python to manage workspaces and accounts. The sample script enables metastore system tables with relevant data about billing, usage, lineage, prices, and access.
utilize Databricks Unity Catalog and get data lineage for your tables and columns and a simplified permission model for your data.
utilize Databricks Lakeflow Jobs to execute a DAG and task parameters to share context information between tasks (see Task Parameters section). Yes, you don't need Airflow to manage your DAGs here!!!
utilize serverless clusters on Databricks Free Edition to deploy your pipelines.
utilize Databricks job clusters to reduce costs.
define Databricks clusters on AWS and Azure.

🧠 Resources

For a debate on the use of notebooks vs. Python packaging, please refer to:

Sessions on Databricks Asset Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Jobs (former Workflows)

Task Output

Data Lineage (Catalog Explorer)

Data Quality (generated by Databricks DQX)

CI/CD pipeline

Instructions

1) Create a Databricks Workspace

option 1) utilize a Databricks Free Edition workspace.

option 2) create a Premium workspace. Follow instructions here

2) Install and configure Databricks CLI on your local machine

Follow the instructions here

3) Build Python env and execute unit tests on your local machine

    make install & make test

You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

4) Deploy and execute on the dev workspace.

option 1) for Databricks Free Edition use:

    make deploy-serverless env=dev
    make deploy-serverless env=staging
    make deploy-serverless env=prod

option 2) for Premium workspace:

    Update "job_clusters" properties on wf_template.yml file. There are different properties for AWS and Azure.

    make deploy env=dev
    make deploy env=staging
    make deploy env=prod

5) configure CI/CD automation

Configure Github Actions repository secrets DATABRICKS_HOST and DATABRICKS_TOKEN.

Task parameters

task (required) - determines the current task to be executed.
env (required) - determines the AWS account where the job is running. This parameter also defines the default catalog for the task.
user (required) - determines the name of the catalog when env is "dev".
schema (optional) - determines the default schema to read/store tables.
skip (optional) - determines if the current task should be skipped.
debug (optional) - determines if the current task should go through debug conditional.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
conf		conf
docs		docs
scripts		scripts
src/template		src/template
tests/job1		tests/job1
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
databricks.yml		databricks.yml
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

🚀 Overview

🧪 Technologies Used

📦 Features

🧠 Resources

Jobs (former Workflows)

Task Output

Data Lineage (Catalog Explorer)

Data Quality (generated by Databricks DQX)

CI/CD pipeline

Instructions

1) Create a Databricks Workspace

2) Install and configure Databricks CLI on your local machine

3) Build Python env and execute unit tests on your local machine

4) Deploy and execute on the dev workspace.

5) configure CI/CD automation

Task parameters

About

Uh oh!

Releases

Packages

Uh oh!

Languages

andre-salvati/databricks-template

Folders and files

Latest commit

History

Repository files navigation

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

🚀 Overview

🧪 Technologies Used

📦 Features

🧠 Resources

Jobs (former Workflows)

Task Output

Data Lineage (Catalog Explorer)

Data Quality (generated by Databricks DQX)

CI/CD pipeline

Instructions

1) Create a Databricks Workspace

2) Install and configure Databricks CLI on your local machine

3) Build Python env and execute unit tests on your local machine

4) Deploy and execute on the dev workspace.

5) configure CI/CD automation

Task parameters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages