diff --git a/docs/using-anovos/setting-up/on_azure_databricks.md b/docs/using-anovos/setting-up/on_azure_databricks.md index 6c9a7341..e5886884 100644 --- a/docs/using-anovos/setting-up/on_azure_databricks.md +++ b/docs/using-anovos/setting-up/on_azure_databricks.md @@ -500,3 +500,91 @@ in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). The remaining steps are the same as above, so please continue with [Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) + +## 3. Anovos on Azure Databricks using direct access to Azure Blob Storage containers + +### Step 3.1: Installing/Downloading Anovos + +This step is identical to +[Step 1.1: Installing _Anovos_ on Azure Databricks](#step-11-installing-anovos-on-azure-databricks). + +### Step 3.2: Copy the dataset to an Azure Blob Storage container + +This step is identical to +[Step 2.2: Copy the dataset to an Azure Blob Storage container](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). + +### Step 3.3: Add the secret to the Spark configuration + +To access files in an Azure Blob Storage container for running _Anovos_ on the Azure Databricks platform, +you need to either add the Storage account key or an SAS token to the Spark cluster configuration. + +The following command adds the Storage account key to the Spark cluster configuration: + +```spark.conf.set("fs.azure.account.key..dfs.core.windows.net", ``` + +Here, +- `` is the name of your Azure Blob Storage account +- `` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret) + +You can access the contents of a storage account using an SAS token as well. + +The following commands add the generated SAS token to the Spark cluster configuration: + +``` +spark.conf.set("fs.azure.account.auth.type..dfs.core.windows.net", "SAS") +spark.conf.set("fs.azure.sas.token.provider.type..dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")``` +spark.conf.set("fs.azure.sas.fixed.token..dfs.core.windows.net", "") +``` + +To learn more about accessing Azure Blob Storage containers using the `abfss` protocol, please refer to +[the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage). + +### Step 3.4: Update the input and output paths in the _Anovos_ workflow configuration + +The input and output paths need to be prefixed with the following value: + +```abfss://@.dfs.core.windows.net/``` + +Here, +- `` is the name of your Azure Blob Storage account +- `` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret) + +The example configuration file we use in this tutorial can be found at `config/configs_income_azure_blob_mount.yaml` +in the [_Anovos_ GitHub repository](https://github.com/anovos/anovos). +It will need to be updated to reflect the path of the Azure Blob Storage container's mount point set above. + +In order for _Anovos_ to be able to find the input data and write the output to the correct location, +update all paths to contain the path of the mount point: + +```yaml +file_path: "abfss://@.dfs.core.windows.net/..." +``` + +🤓 _Example:_ + +```yaml + read_dataset: + file_path: "abfss://@.dfs.core.windows.net/income_dataset/csv/" + file_type: csv +``` + +Here, the URL points to the Storage container and account +`abfss://@.dfs.core.windows.net/` +and the input dataset is stored in a folder called `income_dataset/csv` within the Azure Blob Storage container. + +To learn more about the _Anovos_ workflow configuration file and specifying paths for input and output data, +have a look at the [Configuring Workloads](../config_file.md) page. + +### Step 3.5: Copy the updated configuration file to Databricks DBFS + +Once you have updated the configuration file, copy it to Azure Databricks using the same command that was used +in [Step 1.2](#step-12-prepare-and-copy-the-workflow-configuration-and-data-to-dbfs). + +You can now configure the `file_path` to point to that location. + +### Remaining Steps + +The remaining steps are the same as above, so please continue with +[Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) + +