From 6ada07d54a16f7d95ff0993b366328cd8520b1af Mon Sep 17 00:00:00 2001 From: mathiaspet Date: Tue, 22 Nov 2022 08:34:13 +0100 Subject: [PATCH 1/3] feat: add documentation for direct access to Azure Storage accounts --- .../setting-up/on_azure_databricks.md | 77 +++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/docs/using-anovos/setting-up/on_azure_databricks.md b/docs/using-anovos/setting-up/on_azure_databricks.md index 6c9a7341..2332a2f8 100644 --- a/docs/using-anovos/setting-up/on_azure_databricks.md +++ b/docs/using-anovos/setting-up/on_azure_databricks.md @@ -500,3 +500,80 @@ in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). The remaining steps are the same as above, so please continue with [Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) + +## 3. Anovos on Azure Databricks Using an Azure Blob Storage Container using + +### Step 3.1: Installing/Downloading Anovos + +This step is identical to +[Step 1.1: Installing _Anovos_ on Azure Databricks](#step-11-installing-anovos-on-azure-databricks). + +### Step 3.2: Copy the dataset to an Azure Blob Storage container + +This step is identical to +[Step 2.2: Copy the dataset to an Azure Blob Storage container](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). + +### Step 2.3: Mount an Azure Blob Storage Container as a DBFS path in Azure Databricks + +To access files in an Azure Blob Storage container for running _Anovos_ in Azure Databricks platform, +you need to mount that container in the DBFS path. + +```spark.conf.set("fs.azure.account.key..dfs.core.windows.net", ``` + +TODO: CHECKOUT IF SAS-TOKEN DOES WORK TOO +Here, +- `` is the name of your Azure Blob Storage account +- `` is the name of a container in your Azure Blob Storage account +- `` is the value of the storage account key (TODO: this is bad practise and should be solved with a secret) +- `` is the SAS token for that storage account + + +To learn more about accessing Azure Blob Storage containers using the abfss protocoll, please refer to +[the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage). + +💡 _Note that you only need to mount the container once._ + _The container will remain mounted at the given mount point._ + _To unmount a container, you can run `dbutils.fs.unmount("/mnt/")` in an Azure Databricks notebook._ + +### Step 2.4: Update the workflow configuration for all input and output paths according to the DBFS mount point + +Once mounting is completed, the data is present in DBFS at the path specified as the mount point. +All operations performed by _Anovos_ when running a workflow will result in changes in the data stored in the +Azure Blob Storage container. + +The example configuration file we use in this tutorial can be found at `config/configs_income_azure_blob_mount.yaml` +in the _Anovos_ repository. +It will need to be updated to reflect the path of the mount point set above. + +In order for _Anovos_ to be able to find the input data and write the output to the correct location, +update all paths to contain the path of the mount point: + +```yaml +file_path: "dbfs:/mnt//..." +``` + +🤓 _Example:_ + +```yaml + read_dataset: + file_path: "dbfs:/mnt/anovos1/income_dataset/csv/" + file_type: csv +``` + +Here, the mount points is `dbfs:/mnt/anovos1` and the input dataset is stored in a folder called `income_dataset/csv` +within the Azure Blob Storage container. + +To learn more about the _Anovos_ workflow configuration file and specifying paths for input and output data, +have a look at the [Configuring Workloads](../config_file.md) page. + +### Step 2.5: Copy the updated configuration file from the local machine to the Azure Blob Storage container + +Once you have updated the configuration file, copy it to Azure Databricks using the same command that was used +in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). + +### Remaining Steps + +The remaining steps are the same as above, so please continue with +[Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) + + From 17381f5ad70883b15a961505f1b5ec3e74cebcf9 Mon Sep 17 00:00:00 2001 From: mathiaspet Date: Tue, 22 Nov 2022 08:43:21 +0100 Subject: [PATCH 2/3] feat: removed secret related text --- .../setting-up/on_azure_databricks.md | 42 ++++++++++--------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/docs/using-anovos/setting-up/on_azure_databricks.md b/docs/using-anovos/setting-up/on_azure_databricks.md index 2332a2f8..49e93224 100644 --- a/docs/using-anovos/setting-up/on_azure_databricks.md +++ b/docs/using-anovos/setting-up/on_azure_databricks.md @@ -501,7 +501,7 @@ in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). The remaining steps are the same as above, so please continue with [Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) -## 3. Anovos on Azure Databricks Using an Azure Blob Storage Container using +## 3. Anovos on Azure Databricks using direct access to Azure Blob Storage Container ### Step 3.1: Installing/Downloading Anovos @@ -513,33 +513,36 @@ This step is identical to This step is identical to [Step 2.2: Copy the dataset to an Azure Blob Storage container](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). -### Step 2.3: Mount an Azure Blob Storage Container as a DBFS path in Azure Databricks +### Step 3.3: Add the secret to the spark configuration To access files in an Azure Blob Storage container for running _Anovos_ in Azure Databricks platform, -you need to mount that container in the DBFS path. +you need to either add the storage account key or an SAS token to the spark cluster config. +The following command adds the storage account key to the spark config: ```spark.conf.set("fs.azure.account.key..dfs.core.windows.net", ``` -TODO: CHECKOUT IF SAS-TOKEN DOES WORK TOO Here, - `` is the name of your Azure Blob Storage account -- `` is the name of a container in your Azure Blob Storage account - `` is the value of the storage account key (TODO: this is bad practise and should be solved with a secret) -- `` is the SAS token for that storage account +You can access the contents of a storage account using an SAS token as well. The following commands add the generated SAS token to the spark cluster config: +```spark.conf.set("fs.azure.account.auth.type..dfs.core.windows.net", "SAS")``` +```spark.conf.set("fs.azure.sas.token.provider.type..dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")``` +```spark.conf.set("fs.azure.sas.fixed.token..dfs.core.windows.net", "")``` To learn more about accessing Azure Blob Storage containers using the abfss protocoll, please refer to [the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage). -💡 _Note that you only need to mount the container once._ - _The container will remain mounted at the given mount point._ - _To unmount a container, you can run `dbutils.fs.unmount("/mnt/")` in an Azure Databricks notebook._ -### Step 2.4: Update the workflow configuration for all input and output paths according to the DBFS mount point +### Step 3.4: Update the workflow configuration for all input and output paths according to the DBFS mount point -Once mounting is completed, the data is present in DBFS at the path specified as the mount point. -All operations performed by _Anovos_ when running a workflow will result in changes in the data stored in the -Azure Blob Storage container. +The input and output paths need to be prefixed with the following value: + +```abfss://@.dfs.core.windows.net/``` + +Here, +- `` is the name of your Azure Blob Storage account +- `` is the value of the storage account key (TODO: this is bad practise and should be solved with a secret) The example configuration file we use in this tutorial can be found at `config/configs_income_azure_blob_mount.yaml` in the _Anovos_ repository. @@ -549,27 +552,28 @@ In order for _Anovos_ to be able to find the input data and write the output to update all paths to contain the path of the mount point: ```yaml -file_path: "dbfs:/mnt//..." +file_path: "abfss://@.dfs.core.windows.net/..." ``` 🤓 _Example:_ ```yaml read_dataset: - file_path: "dbfs:/mnt/anovos1/income_dataset/csv/" + file_path: "abfss://@.dfs.core.windows.net/income_dataset/csv/" file_type: csv ``` -Here, the mount points is `dbfs:/mnt/anovos1` and the input dataset is stored in a folder called `income_dataset/csv` -within the Azure Blob Storage container. +Here, the URL points to the storage container and account `abfss://@.dfs.core.windows.net/` and the input dataset is stored in a folder called `income_dataset/csv` within the Azure Blob Storage container. To learn more about the _Anovos_ workflow configuration file and specifying paths for input and output data, have a look at the [Configuring Workloads](../config_file.md) page. -### Step 2.5: Copy the updated configuration file from the local machine to the Azure Blob Storage container +### Step 3.5: Copy the updated configuration file to Databricks DBFS Once you have updated the configuration file, copy it to Azure Databricks using the same command that was used -in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). +in [Step 1.2](#step-12-prepare-and-copy-the-workflow-configuration-and-data-to-dbfs). + +You can configure the file_path now to that location. ### Remaining Steps From 4f53977b860b6adfca158114a7424303d408c44b Mon Sep 17 00:00:00 2001 From: Kilian Kluge <32523967+ionicsolutions@users.noreply.github.com> Date: Tue, 22 Nov 2022 14:09:14 +0100 Subject: [PATCH 3/3] Update on_azure_databricks.md --- .../setting-up/on_azure_databricks.md | 43 +++++++++++-------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/docs/using-anovos/setting-up/on_azure_databricks.md b/docs/using-anovos/setting-up/on_azure_databricks.md index 49e93224..e5886884 100644 --- a/docs/using-anovos/setting-up/on_azure_databricks.md +++ b/docs/using-anovos/setting-up/on_azure_databricks.md @@ -501,7 +501,7 @@ in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). The remaining steps are the same as above, so please continue with [Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job) -## 3. Anovos on Azure Databricks using direct access to Azure Blob Storage Container +## 3. Anovos on Azure Databricks using direct access to Azure Blob Storage containers ### Step 3.1: Installing/Downloading Anovos @@ -513,28 +513,33 @@ This step is identical to This step is identical to [Step 2.2: Copy the dataset to an Azure Blob Storage container](#step-22-copy-the-dataset-to-an-azure-blob-storage-container). -### Step 3.3: Add the secret to the spark configuration +### Step 3.3: Add the secret to the Spark configuration -To access files in an Azure Blob Storage container for running _Anovos_ in Azure Databricks platform, -you need to either add the storage account key or an SAS token to the spark cluster config. -The following command adds the storage account key to the spark config: +To access files in an Azure Blob Storage container for running _Anovos_ on the Azure Databricks platform, +you need to either add the Storage account key or an SAS token to the Spark cluster configuration. + +The following command adds the Storage account key to the Spark cluster configuration: ```spark.conf.set("fs.azure.account.key..dfs.core.windows.net", ``` Here, - `` is the name of your Azure Blob Storage account -- `` is the value of the storage account key (TODO: this is bad practise and should be solved with a secret) +- `` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret) -You can access the contents of a storage account using an SAS token as well. The following commands add the generated SAS token to the spark cluster config: -```spark.conf.set("fs.azure.account.auth.type..dfs.core.windows.net", "SAS")``` -```spark.conf.set("fs.azure.sas.token.provider.type..dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")``` -```spark.conf.set("fs.azure.sas.fixed.token..dfs.core.windows.net", "")``` +You can access the contents of a storage account using an SAS token as well. -To learn more about accessing Azure Blob Storage containers using the abfss protocoll, please refer to -[the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage). +The following commands add the generated SAS token to the Spark cluster configuration: +``` +spark.conf.set("fs.azure.account.auth.type..dfs.core.windows.net", "SAS") +spark.conf.set("fs.azure.sas.token.provider.type..dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")``` +spark.conf.set("fs.azure.sas.fixed.token..dfs.core.windows.net", "") +``` -### Step 3.4: Update the workflow configuration for all input and output paths according to the DBFS mount point +To learn more about accessing Azure Blob Storage containers using the `abfss` protocol, please refer to +[the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage). + +### Step 3.4: Update the input and output paths in the _Anovos_ workflow configuration The input and output paths need to be prefixed with the following value: @@ -542,11 +547,11 @@ The input and output paths need to be prefixed with the following value: Here, - `` is the name of your Azure Blob Storage account -- `` is the value of the storage account key (TODO: this is bad practise and should be solved with a secret) +- `` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret) The example configuration file we use in this tutorial can be found at `config/configs_income_azure_blob_mount.yaml` -in the _Anovos_ repository. -It will need to be updated to reflect the path of the mount point set above. +in the [_Anovos_ GitHub repository](https://github.com/anovos/anovos). +It will need to be updated to reflect the path of the Azure Blob Storage container's mount point set above. In order for _Anovos_ to be able to find the input data and write the output to the correct location, update all paths to contain the path of the mount point: @@ -563,7 +568,9 @@ file_path: "abfss://@.dfs.core.windows.net file_type: csv ``` -Here, the URL points to the storage container and account `abfss://@.dfs.core.windows.net/` and the input dataset is stored in a folder called `income_dataset/csv` within the Azure Blob Storage container. +Here, the URL points to the Storage container and account +`abfss://@.dfs.core.windows.net/` +and the input dataset is stored in a folder called `income_dataset/csv` within the Azure Blob Storage container. To learn more about the _Anovos_ workflow configuration file and specifying paths for input and output data, have a look at the [Configuring Workloads](../config_file.md) page. @@ -573,7 +580,7 @@ have a look at the [Configuring Workloads](../config_file.md) page. Once you have updated the configuration file, copy it to Azure Databricks using the same command that was used in [Step 1.2](#step-12-prepare-and-copy-the-workflow-configuration-and-data-to-dbfs). -You can configure the file_path now to that location. +You can now configure the `file_path` to point to that location. ### Remaining Steps