Skip to content

michael-griehm/azure-databricks

Repository files navigation

azure-data-bricks

This repo contains examples of how to configure and deploy the Azure Databricks platform as a service offering. It also contains examples of Python based Databricks notebooks reading and writing files within an instance of the Azure Data Lake Gen 2 and Snowflake Data Warehouse.

What is Databricks?

Databricks is a lakehouse platform built for the cloud.

Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

The engine that powers Databricks interaction with big data is called Spark.

Spark

Spark is the data access, mutation, and persistent engine that powers Databricks. In fact, Databricks was started in 2013 by the creators of Spark (they also created Delta Lake and MLflow). Databricks houses a version of Spark that is as close to latest release as possible and has features only available in Databricks.

Spark SQL Tables

Every Spark SQL table consists of 2 components:

  • The metadata information that stores the schema
  • The data itself

There are two types of Spark SQL Tables:

  • Managed
  • Unmanaged

Managed Table

A managed table is a Spark SQL table for which Spark manages both the data and the metadata.

The metadata is stored within the DBFS.

The data can be stored in the DBFS or external Hadoop compliant storage platform.

Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Useful Commands

Get Databricks API Permission Ids

$> az ad sp show --id 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d >azure_dbx_permission_list.json

Useful Links

Get DBx Service Principal Token of behalf of User

Azure Data Bricks Pricing

Databricks CLI

Setup Key Vault Secret Scope

Set Secret

Access Azure Data Lake Gen 2 from Azure Databricks

Convert Binary to String with DataFrame and Python

Avro Guide

Pandas and Pythons

Sort and Filter a Data Frame

PySpark Data Types

Mount Databricks to ADLS Gen 2

VS Code Tooling

Terraform and Azure Databricks

Quartz Cron Job

GitHub Actions Set Environment Variable in Step

Deltalake

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published