Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions docs/using-anovos/setting-up/on_azure_databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -500,3 +500,91 @@ in [Step 2.2](#step-22-copy-the-dataset-to-an-azure-blob-storage-container).

The remaining steps are the same as above, so please continue with
[Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job)

## 3. Anovos on Azure Databricks using direct access to Azure Blob Storage containers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some information why one would want to do that?


### Step 3.1: Installing/Downloading Anovos

This step is identical to
[Step 1.1: Installing _Anovos_ on Azure Databricks](#step-11-installing-anovos-on-azure-databricks).

### Step 3.2: Copy the dataset to an Azure Blob Storage container

This step is identical to
[Step 2.2: Copy the dataset to an Azure Blob Storage container](#step-22-copy-the-dataset-to-an-azure-blob-storage-container).

### Step 3.3: Add the secret to the Spark configuration

To access files in an Azure Blob Storage container for running _Anovos_ on the Azure Databricks platform,
you need to either add the Storage account key or an SAS token to the Spark cluster configuration.

The following command adds the Storage account key to the Spark cluster configuration:

```spark.conf.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", <storage-account-key>```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do I use/add this line?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Here,
- `<storage-account-name>` is the name of your Azure Blob Storage account
- `<storage-account-key>` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret)

You can access the contents of a storage account using an SAS token as well.

The following commands add the generated SAS token to the Spark cluster configuration:

```
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")```
spark.conf.set("fs.azure.sas.fixed.token.<storage-account-name>.dfs.core.windows.net", "<sas-token>")
```

To learn more about accessing Azure Blob Storage containers using the `abfss` protocol, please refer to
[the Azure Blob Storage documentation](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage).

### Step 3.4: Update the input and output paths in the _Anovos_ workflow configuration

The input and output paths need to be prefixed with the following value:

```abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/```

Here,
- `<storage-account-name>` is the name of your Azure Blob Storage account
- `<storage-account-key>` is the value of the Storage account key (TODO: this is bad practice and should be solved with a secret)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathiaspet Please provide information on how to do this properly. Don't mention a bad practice in the docs :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would assume that this is fine in certain situations and not others. Would be great to tell people when/why this is (not) recommended.


The example configuration file we use in this tutorial can be found at `config/configs_income_azure_blob_mount.yaml`
in the [_Anovos_ GitHub repository](https://github.com/anovos/anovos).
It will need to be updated to reflect the path of the Azure Blob Storage container's mount point set above.

In order for _Anovos_ to be able to find the input data and write the output to the correct location,
update all paths to contain the path of the mount point:

```yaml
file_path: "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/..."
```

🤓 _Example:_

```yaml
read_dataset:
file_path: "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/income_dataset/csv/"
file_type: csv
```

Here, the URL points to the Storage container and account
`abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/`
and the input dataset is stored in a folder called `income_dataset/csv` within the Azure Blob Storage container.

To learn more about the _Anovos_ workflow configuration file and specifying paths for input and output data,
have a look at the [Configuring Workloads](../config_file.md) page.

### Step 3.5: Copy the updated configuration file to Databricks DBFS

Once you have updated the configuration file, copy it to Azure Databricks using the same command that was used
in [Step 1.2](#step-12-prepare-and-copy-the-workflow-configuration-and-data-to-dbfs).

You can now configure the `file_path` to point to that location.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where`/how?


### Remaining Steps

The remaining steps are the same as above, so please continue with
[Step 1.4](#step-14-configure-and-launch-an-anovos-workflow-as-a-databricks-job)