Skip to content

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

Notifications You must be signed in to change notification settings

andre-salvati/databricks-template

Repository files navigation

Databricks PySpark CI/CD Stars

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

🚀 Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It aims to bring software engineering best practices—such as modular architecture, automated testing, and CI/CD—into the world of data engineering. By combining a clean project structure with robust development and deployment workflows, this template helps teams move faster with confidence.

You’re encouraged to adapt the structure and tooling to suit your project’s specific needs and environment.

Interested in bringing these principles in your own project? Let’s connect on Linkedin.

🧪 Technologies Used

  • Databricks Free Edition (Serverless)
  • PySpark 3.4+
  • Databricks Asset Bundles
  • Databricks DQX
  • Databricks Jobs
  • Databricks Unity Catalog
  • Python 3.10+
  • GitHub Actions
  • Pytest

📦 Features

This project template demonstrates how to:

  • structure PySpark code inside classes/packages.
  • structure unit tests for the data transformations and set up VSCode to run them on your local machine.
  • structure integration tests to be executed on different environments / catalogs.
  • utilize Databricks Asset Bundles to package/deploy/run a Python wheel package on Databricks.
  • utilize Databricks DQX to define and enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation.
  • utilize a medallion architecture pattern.

  • package and deploy code to different environments (dev, staging, prod) using a CI/CD pipeline with Github Actions.
  • isolate "dev" environments / catalogs to avoid concurrency issues between developers testing jobs.
  • configure the workflow to run in different environments with different parameters with jinja package.
  • configure the workflow to run tasks selectively.

  • lint and format code with ruff and pre-commit.
  • use a Make file to automate repetitive tasks.
  • utilize pipenv/Pipfile to prepare local and remote envs.
  • utilize pytest package to run unit tests on transformations and generate test coverage reports.
  • utilize argparse package to build a flexible command line interface to start the jobs.
  • utilize funcy package to log the execution time of each transformation.

🧠 Resources

For a debate on the use of notebooks vs. Python packaging, please refer to:

Sessions on Databricks Asset Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Jobs (former Workflows)



Task Output



Data Lineage (Catalog Explorer)



Data Quality (generated by Databricks DQX)



CI/CD pipeline



Instructions

1) Create a Databricks Workspace

option 1) utilize a Databricks Free Edition workspace.

option 2) create a Premium workspace. Follow instructions here

2) Install and configure Databricks CLI on your local machine

Follow the instructions here

3) Build Python env and execute unit tests on your local machine

    make install & make test

You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

4) Deploy and execute on the dev workspace.

option 1) for Databricks Free Edition use:

    make deploy-serverless env=dev
    make deploy-serverless env=staging
    make deploy-serverless env=prod

option 2) for Premium workspace:

    Update "job_clusters" properties on wf_template.yml file. There are different properties for AWS and Azure.

    make deploy env=dev
    make deploy env=staging
    make deploy env=prod

5) configure CI/CD automation

Configure Github Actions repository secrets DATABRICKS_HOST and DATABRICKS_TOKEN.

Task parameters


  • task (required) - determines the current task to be executed.
  • env (required) - determines the AWS account where the job is running. This parameter also defines the default catalog for the task.
  • user (required) - determines the name of the catalog when env is "dev".
  • schema (optional) - determines the default schema to read/store tables.
  • skip (optional) - determines if the current task should be skipped.
  • debug (optional) - determines if the current task should go through debug conditional.

About

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published