This repo contains examples of how to configure and deploy the Azure Databricks platform as a service offering. It also contains examples of Python based Databricks notebooks reading and writing files within an instance of the Azure Data Lake Gen 2 and Snowflake Data Warehouse.
Databricks is a lakehouse platform built for the cloud.
Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.
The engine that powers Databricks interaction with big data is called Spark.
Spark is the data access, mutation, and persistent engine that powers Databricks. In fact, Databricks was started in 2013 by the creators of Spark (they also created Delta Lake and MLflow). Databricks houses a version of Spark that is as close to latest release as possible and has features only available in Databricks.
Every Spark SQL table consists of 2 components:
- The metadata information that stores the schema
- The data itself
There are two types of Spark SQL Tables:
- Managed
- Unmanaged
A managed table is a Spark SQL table for which Spark manages both the data and the metadata.
The metadata is stored within the DBFS.
The data can be stored in the DBFS or external Hadoop compliant storage platform.
Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.
$> az ad sp show --id 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d >azure_dbx_permission_list.json
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/secrets
- https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#create-an-azure-key-vault-backed-secret-scope-using-the-databricks-cli
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/#install-the-cli
- https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secrets
- https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion
- https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/read-avro
- https://docs.databricks.com/data/data-sources/read-avro.html
- https://www.geeksforgeeks.org/python-pandas-dataframe/
- https://mkaz.blog/code/python-string-format-cookbook/
- https://www.geeksforgeeks.org/get-yesterdays-date-using-python/
- https://dev.to/sridharanprasanna/using-wildcards-for-folder-path-with-spark-dataframe-load-4jo7
- https://stackoverflow.com/questions/32233575/read-all-files-in-a-nested-folder-in-spark
- https://sparkbyexamples.com/spark/spark-how-to-sort-dataframe-column-explained/
- https://sparkbyexamples.com/spark/spark-filter-rows-with-null-values/
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/terraform/
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/terraform/workspace-management
- https://www.terraform.io/cli/commands/output
- https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/databricks_workspace
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/terraform/azure-workspace
- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/terraform/
- https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet
- https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/databricks_workspace#vnet_address_prefix
- https://techcommunity.microsoft.com/t5/azure-data-factory-blog/azure-databricks-activities-now-support-managed-identity/ba-p/1922818
- https://www.azenix.com.au/blog/databricks-on-azure-with-terraform