Skip to content

Latest commit

 

History

History
109 lines (95 loc) · 3.78 KB

README.md

File metadata and controls

109 lines (95 loc) · 3.78 KB

Databricks CI/CD

PyPI Latest Release

This is a tool for building CI/CD pipelines for Databricks. It is a python package that works in conjunction with a custom GIT repository (or a simple file structure) to validate and deploy content to databricks. Currently, it can handle the following content:

  • Workspace - a collection of notebooks written in Scala, Python, R or SQL
  • Jobs - list of Databricks jobs
  • Clusters
  • Instance Pools
  • DBFS - an arbitrary collection of files that may be deployed on a Databricks workspace

Installation

pip install databricks-cicd

Requirements

To use this tool, you need a source directory structure (preferably as a private GIT repository) that has the following structure:

any_local_folder_or_git_repo/
├── workspace/
│   ├── some_notebooks_subdir
│   │   └── Notebook 1.py
│   ├── Notebook 2.sql
│   ├── Notebook 3.r
│   └── Notebook 4.scala
├── jobs/
│   ├── My first job.json
│   └── Side gig.json
├── clusters/
│   ├── orion.json
│   └── Another cluster.json
├── instance_pools/
│   ├── Pool 1.json
│   └── Pool 2.json
└── dbfs/
    ├── strawbery_jam.jar
    ├── subdir
    │   └── some_other.jar
    ├── some_python.egg
    └── Ice cream.jpeg

Note: All folder names represent the default and can be configured. This is just a sample.

Usage

For the latest options and commands run:

cicd -h

A sample command could be:

cicd deploy \
   -w sample_12432.7.azuredatabricks.net \
   -u john.smith@domain.com \
   -t dapi_sample_token_0d5-2 \
   -lp '~/git/my-private-repo' \
   -tp /blabla \
   -c DEV.ini \
   --verbose

Note: Paths for windows need to be in double quotes

The default configuration is defined in default.ini and can be overridden with a custom ini file using the -c option, usually one config file per target environment. (sample)

Create content

Notebooks:

  1. Add a notebook to source
    1. On the databricks UI go to your notebook.
    2. Click on File -> Export -> Source file.
    3. Add that file to the workspace folder of this repo without changing the file name.

Jobs:

  1. Add a job to source
    1. Get the source of the job and write it to a file. You need to have the Databricks CLI and JQ installed. For Windows, it is easier to rename the jq-win64.exe to jq.exe and place it in c:\Windows\System32 folder. Then on Windows/Linux/MAC:

      databricks jobs get --job-id 74 | jq .settings > Job_Name.json
      

      This downloads the source JSON of the job from the databricks server and pulls only the settings from it, then writes it in to a file.

      Note: The file name should be the same as the job name within the json file. Please, avoid spaces in names.

    2. Add that file to the jobs folder

Clusters:

  1. Add a cluster to source
    1. Get the source of the cluster and write it to a file.
      databricks clusters get --cluster-name orion > orion.json
      
      Note: The file name should be the same as the cluster name within the json file. Please, avoid spaces in names.
    2. Add that file to the clusters folder

Instance pools:

  1. Add an instance pool to source
    1. Similar to clusters, just use instance-pools instead of clusters

DBFS:

  1. Add a file to dbfs
    1. Just add a file to the the dbfs folder.