Skip to content

Latest commit

 

History

History
373 lines (286 loc) · 24.7 KB

File metadata and controls

373 lines (286 loc) · 24.7 KB
layout page_title sidebar_current description
databricks
Provider: Databricks
docs-databricks-index
Terraform provider databricks.

Databricks Provider

Use the Databricks Terraform provider to interact with almost all of Databricks resources. If you're new to Databricks, please follow guide to create a workspace on Azure or AWS and then this workspace management tutorial. If you're migrating from version 0.2.x, please follow this guide. Changelog is available on GitHub.

Resources

Compute resources

Storage

Security

E2 Architecture

Databricks SQL

Example Usage

provider "databricks" {
}

data "databricks_current_user" "me" {}
data "databricks_spark_version" "latest" {}
data "databricks_node_type" "smallest" {
  local_disk = true
}

resource "databricks_notebook" "this" {
  path     = "${data.databricks_current_user.me.home}/Terraform"
  language = "PYTHON"
  content_base64 = base64encode(<<-EOT
    # created from ${abspath(path.module)}
    display(spark.range(10))
    EOT
  )
}

resource "databricks_job" "this" {
  name = "Terraform Demo (${data.databricks_current_user.me.alphanumeric})"

  new_cluster {
    num_workers   = 1
    spark_version = data.databricks_spark_version.latest.id
    node_type_id  = data.databricks_node_type.smallest.id
  }

  notebook_task {
    notebook_path = databricks_notebook.this.path
  }

  email_notifications {}
}

output "notebook_url" {
  value = databricks_notebook.this.url
}

output "job_url" {
  value = databricks_job.this.url
}

Authentication

!> Warning Please be aware that hard coding any credentials in plain text is not something that is recommended. We strongly recommend using a Terraform backend that supports encryption. Please use environment variables, ~/.databrickscfg file, encrypted .tfvars files or secret store of your choice (Hashicorp Vault, AWS Secrets Manager, AWS Param Store, Azure Key Vault)

There are currently three supported methods to authenticate into the Databricks platform to create resources:

Authenticating with Databricks CLI credentials

No configuration options given to your provider will look up configured credentials in ~/.databrickscfg file. It is created by the databricks configure --token command. Check this page for more details. The provider uses config file credentials only when host/token or azure_auth options are not specified. It is the recommended way to use Databricks Terraform provider, in case you're already using the same approach with AWS Shared Credentials File or Azure CLI authentication.

provider "databricks" {
}

You can specify non-standard location of configuration file through config_file parameter or DATABRICKS_CONFIG_FILE environment variable:

provider "databricks" {
  config_file = "/opt/databricks/cli-config"
}

You can specify a CLI connection profile through profile parameter or DATABRICKS_CONFIG_PROFILE environment variable:

provider "databricks" {
  profile = "ML_WORKSPACE"
}

Authenticating with hostname and token

You can use host and token parameters to supply credentials to the workspace. When environment variables are preferred, then you can specify DATABRICKS_HOST and DATABRICKS_TOKEN instead. Environment variables are the second most recommended way of configuring this provider.

provider "databricks" {
  host  = "https://abc-cdef-ghi.cloud.databricks.com"
  token = "dapitokenhere"
}

Authenticating with hostname, username, and password

!> Warning This approach is currently recommended only for provisioning AWS workspaces and should be avoided for regular use.

You can use the username + password attributes to authenticate provider for E2 workspace setup. Respective DATABRICKS_USERNAME and DATABRICKS_PASSWORD environment variables are applicable as well.

provider "databricks" {
  host = "https://accounts.cloud.databricks.com"
  username = var.user
  password = var.password
}

Argument Reference

-> Note If you experience technical difficulties with rolling out resources in this example, please make sure that environment variables don't conflict with other provider block attributes. When in doubt, please run TF_LOG=DEBUG terraform apply to enable debug mode through the TF_LOG environment variable. Look specifically for Explicit and implicit attributes lines, that should indicate authentication attributes used.

The provider block supports the following arguments:

  • host - (optional) This is the host of the Databricks workspace. It is a URL that you use to login to your workspace. Alternatively, you can provide this value as an environment variable DATABRICKS_HOST.
  • token - (optional) This is the API token to authenticate into the workspace. Alternatively, you can provide this value as an environment variable DATABRICKS_TOKEN.
  • username - (optional) This is the username of the user that can log into the workspace. Alternatively, you can provide this value as an environment variable DATABRICKS_USERNAME. Recommended only for creating workspaces in AWS.
  • password - (optional) This is the user's password that can log into the workspace. Alternatively, you can provide this value as an environment variable DATABRICKS_PASSWORD. Recommended only for creating workspaces in AWS.
  • config_file - (optional) Location of the Databricks CLI credentials file created by databricks configure --token command (~/.databrickscfg by default). Check Databricks CLI documentation for more details. The provider uses configuration file credentials when you don't specify host/token/username/password/azure attributes. Alternatively, you can provide this value as an environment variable DATABRICKS_CONFIG_FILE. This field defaults to ~/.databrickscfg.
  • profile - (optional) Connection profile specified within ~/.databrickscfg. Please check connection profiles section for more details. This field defaults to DEFAULT.
  • account_id - (optional) Account Id that could be found in the bottom left corner of Accounts Console. Alternatively, you can provide this value as an environment variable DATABRICKS_ACCOUNT_ID. Only has effect when host = "https://accounts.cloud.databricks.com/" and currently used to provision account admins via databricks_user. In the future releases of the provider this property will also be used specify account for databricks_mws_* resources as well.

Special configurations for Azure

The provider works with Azure CLI authentication to facilitate local development workflows, though for automated scenarios a service principal auth is necessary (and specification of azure_use_msi, azure_client_id, azure_client_secret and azure_tenant_id parameters).

Authenticating with Azure MSI

Since v0.3.8, it's possible to leverage Azure Managed Service Identity authentication, which is using the same environment variables as azurerm provider. Both SystemAssigned and UserAssigned identities work, as long as they have Contributor role on subscription level and created the workspace resource, or directly added to workspace through databricks_service_principal.

provider "databricks" {
  host = data.azurerm_databricks_workspace.this.workspace_url
  
  # ARM_USE_MSI environment variable is recommended
  azure_use_msi = true 
}

Authenticating with Azure CLI

It's possible to use Azure CLI authentication, where the provider would rely on access token cached by az login command so that local development scenarios are possible. Technically, the provider will call az account get-access-token each time before an access token is about to expire.

provider "azurerm" {
  features {}
}

resource "azurerm_databricks_workspace" "this" {
  location                      = "centralus"
  name                          = "my-workspace-name"
  resource_group_name           = var.resource_group
  sku                           = "premium"
}

provider "databricks" {
  host = azurerm_databricks_workspace.this.workspace_url
}

resource "databricks_user" "my-user" {
  user_name     = "test-user@databricks.com"
  display_name  = "Test User"
}

Authenticating with Azure Service Principal

!> Warning Please note that the azure service principal authentication currently (since v0.3.7) uses the AAD token for the authentication (SPN should have Contributor role on Databricks workspace). You can restore previous functionality (generating the PAT for service principal) by setting azure_use_pat_for_spn to true (you can regulate the lifetime of generated PAT with pat_token_duration_seconds setting). Azure Databricks does not yet support AAD tokens for secret scopes. Databricks Labs team will refactor it transparently once that support is available. The only impacted field is pat_token_duration_seconds, which will be deprecated and fully supported after AAD support.

provider "azurerm" {
  client_id         = var.client_id
  client_secret     = var.client_secret
  tenant_id         = var.tenant_id
  subscription_id   = var.subscription_id
}

resource "azurerm_databricks_workspace" "this" {
  location                      = "centralus"
  name                          = "my-workspace-name"
  resource_group_name           = var.resource_group
  sku                           = "premium"
}

provider "databricks" {
  host                = azurerm_databricks_workspace.this.workspace_url
  azure_client_id     = var.client_id
  azure_client_secret = var.client_secret
  azure_tenant_id     = var.tenant_id
}

resource "databricks_user" "my-user" {
  user_name = "test-user@databricks.com"
}
  • azure_workspace_resource_id - (optional) id attribute of azurerm_databricks_workspace resource. Combination of subscription id, resource group name, and workspace name. This field is deprecated since v0.3.8 in favor of host = azurerm_databricks_workspace.this.workspace_url.
  • azure_workspace_name - (optional) This is the name of your Azure Databricks Workspace. Alternatively, you can provide this value as an environment variable DATABRICKS_AZURE_WORKSPACE_NAME. Not needed with azure_workspace_resource_id is set. Deprecated since v0.3.8.
  • azure_resource_group - (optional) This is the resource group in which your Azure Databricks Workspace resides. Alternatively, you can provide this value as an environment variable DATABRICKS_AZURE_RESOURCE_GROUP. Not needed with azure_workspace_resource_id is set. Deprecated since v0.3.8.
  • azure_subscription_id - (optional) This is the Azure Subscription id in which your Azure Databricks Workspace resides. Alternatively you can provide this value as an environment variable DATABRICKS_AZURE_SUBSCRIPTION_ID or ARM_SUBSCRIPTION_ID. Not needed with azure_workspace_resource_id is set. Deprecated since v0.3.8.
  • azure_client_secret - (optional) This is the Azure Enterprise Application (Service principal) client secret. This service principal requires contributor access to your Azure Databricks deployment. Alternatively, you can provide this value as an environment variable DATABRICKS_AZURE_CLIENT_SECRET or ARM_CLIENT_SECRET.
  • azure_client_id - (optional) This is the Azure Enterprise Application (Service principal) client id. This service principal requires contributor access to your Azure Databricks deployment. Alternatively, you can provide this value as an environment variable DATABRICKS_AZURE_CLIENT_ID or ARM_CLIENT_ID.
  • azure_tenant_id - (optional) This is the Azure Active Directory Tenant id in which the Enterprise Application (Service Principal) resides. Alternatively, you can provide this value as an environment variable DATABRICKS_AZURE_TENANT_ID or ARM_TENANT_ID.
  • azure_environment - (optional) This is the Azure Environment which defaults to the public cloud. Other options are german, china and usgovernment. Alternatively, you can provide this value as an environment variable ARM_ENVIRONMENT.
  • azure_use_msi - (optional) Use Azure Managed Service Identity authentication. Alternatively, you can provide this value as an environment variable ARM_USE_MSI.
  • pat_token_duration_seconds - The current implementation of the azure auth via sp requires the provider to create a temporary personal access token within Databricks. The current AAD implementation does not cover all the APIs for Authentication. This field determines the duration in which that temporary PAT token is alive. It is measured in seconds and will default to 3600 seconds. Deprecated since v0.3.8.

There are multiple environment variable options, the DATABRICKS_AZURE_* environment variables take precedence, and the ARM_* environment variables provide a way to share authentication configuration using the databricks provider alongside the azurerm provider.

Miscellaneous configuration parameters

This section covers configuration parameters not related to authentication. They could be used when debugging problems, or do an additional tuning of provider's behaviour:

  • rate_limit - defines maximum number of requests per second made to Databricks REST API by Terraform. Default is 15.
  • debug_truncate_bytes - Applicable only when TF_LOG=DEBUG is set. Truncate JSON fields in HTTP requests and responses above this limit. Default is 96.
  • debug_headers - Applicable only when TF_LOG=DEBUG is set. Debug HTTP headers of requests made by the provider. Default is false. We recommend to turn this flag on only under exceptional circumstances, when troubleshooting authentication issues. Turning this flag on will log first debug_truncate_bytes of any HTTP header value in cleartext.
  • skip_verify - skips SSL certificate verification for HTTP calls. Use at your own risk. Default is false (don't skip verification).

Environment variables

The following configuration attributes can be passed via environment variables:

Argument Environment variable
host DATABRICKS_HOST
token DATABRICKS_TOKEN
username DATABRICKS_USERNAME
password DATABRICKS_PASSWORD
account_id DATABRICKS_ACCOUNT_ID
config_file DATABRICKS_CONFIG_FILE
profile DATABRICKS_CONFIG_PROFILE
azure_client_secret ARM_CLIENT_SECRET
azure_client_id ARM_CLIENT_ID
azure_tenant_id ARM_TENANT_ID
azure_use_msi ARM_USE_MSI
azure_environment ARM_ENVIRONMENT
debug_truncate_bytes DATABRICKS_DEBUG_TRUNCATE_BYTES
debug_headers DATABRICKS_DEBUG_HEADERS
rate_limit DATABRICKS_RATE_LIMIT

Empty provider block

For example, with the following zero-argument configuration:

provider "databricks" {}
  1. Provider will check all the supported environment variables and set values of relevant arguments.
  2. In case any conflicting arguments are present, the plan will end with an error.
  3. Will check for the presence of host + token pair, continue trying otherwise.
  4. Will check for host + username + password presence, continue trying otherwise.
  5. Will check for Azure workspace ID, azure_client_secret + azure_client_id + azure_tenant_id presence, continue trying otherwise.
  6. Will check for availability of Azure MSI, if enabled via azure_use_msi, continue trying otherwise.
  7. Will check for Azure workspace ID presence, and if AZ CLI returns an access token, continue trying otherwise.
  8. Will check for the ~/.databrickscfg file in the home directory, will fail otherwise.
  9. Will check for profile presence and try picking from that file will fail otherwise.
  10. Will check for host and token or username+password combination, will fail if nothing of these exist.

Data resources and Authentication is not configured errors

In Terraform 0.13 and later, data resources have the same dependency resolution behavior as defined for managed resources. Most data resources make an API call to a workspace. If a workspace doesn't exist yet, authentication is not configured for provider error is raised. To work around this issue and guarantee a proper lazy authentication with data resources, you should add depends_on = [azurerm_databricks_workspace.this] or depends_on = [databricks_mws_workspaces.this] to the body. This issue doesn't occur if workspace is created in one module and resources within the workspace are created in another. We do not recommend using Terraform 0.12 and earlier, if your usage involves data resources.

Multiple Provider Configurations

The most common reason for technical difficulties might be related to missing alias attribute in provider "databricks" {} blocks or provider attribute in resource "databricks_..." {} blocks, when using multiple provider configurations. Please make sure to read alias: Multiple Provider Configurations documentation article.

Error while installing: registry does not have a provider

Error while installing hashicorp/databricks: provider registry
registry.terraform.io does not have a provider named
registry.terraform.io/hashicorp/databricks

If you notice below error, it might be due to the fact that required_providers block is not defined in every module, that uses Databricks Terraform Provider. Create versions.tf file with the following contents:

# versions.tf
terraform {
  required_providers {
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.3.9"
    }
  }
}

... and copy the file in every module in your codebase. Our recommendation is to skip version field for versions.tf file on module level, and keep it only on environment level.

├── environments
│   ├── sandbox
│   │   ├── README.md
│   │   ├── main.tf
│   │   └── versions.tf
│   └── production
│       ├── README.md
│       ├── main.tf
│       └── versions.tf
└── modules
    ├── first-module
    │   ├── ...
    │   └── versions.tf
    └── second-module
        ├── ...
        └── versions.tf

Project Support

Important: Projects in the databrickslabs GitHub account, including the Databricks Terraform Provider, are not formally supported by Databricks. They are maintained by Databricks Field teams and provided as-is. There is no service level agreement (SLA). Databricks makes no guarantees of any kind. If you discover an issue with the provider, please file a GitHub Issue on the repo, and it will be reviewed by project maintainers as time permits.