azure-data-bricks

This repo contains examples of how to configure and deploy the Azure Databricks platform as a service offering. It also contains examples of Python based Databricks notebooks reading and writing files within an instance of the Azure Data Lake Gen 2 and Snowflake Data Warehouse.

What is Databricks?

Databricks is a lakehouse platform built for the cloud.

Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

The engine that powers Databricks interaction with big data is called Spark.

Spark

Spark is the data access, mutation, and persistent engine that powers Databricks. In fact, Databricks was started in 2013 by the creators of Spark (they also created Delta Lake and MLflow). Databricks houses a version of Spark that is as close to latest release as possible and has features only available in Databricks.

Spark SQL Tables

Every Spark SQL table consists of 2 components:

The metadata information that stores the schema
The data itself

There are two types of Spark SQL Tables:

Managed
Unmanaged

Managed Table

A managed table is a Spark SQL table for which Spark manages both the data and the metadata.

The metadata is stored within the DBFS.

The data can be stored in the DBFS or external Hadoop compliant storage platform.

Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Useful Commands

Get Databricks API Permission Ids

$> az ad sp show --id 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d >azure_dbx_permission_list.json

Useful Links

Setup Key Vault Secret Scope

Set Secret

https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secrets

Access Azure Data Lake Gen 2 from Azure Databricks

https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.html

Convert Binary to String with DataFrame and Python

https://stackoverflow.com/questions/57186799/how-to-extract-columns-from-binarytype-using-pyspark-databricks

Avro Guide

Pandas and Pythons

Sort and Filter a Data Frame

PySpark Data Types

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.IntegerType.html

Mount Databricks to ADLS Gen 2

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-databricks-spark

VS Code Tooling

https://code.visualstudio.com/docs/datascience/jupyter-notebooks

Terraform and Azure Databricks

Quartz Cron Job

https://www.freeformatter.com/cron-expression-generator-quartz.html

GitHub Actions Set Environment Variable in Step

https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#environment-files

Deltalake

https://docs.microsoft.com/en-us/azure/databricks/delta/delta-intro

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
.vscode		.vscode
dbx-cli		dbx-cli
dbx-secret-scope		dbx-secret-scope
dbx-sp-token		dbx-sp-token
iac		iac
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
azure_dbx_permission_list.json		azure_dbx_permission_list.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation