Please note: This repo should only be used by Pachyderm-approved private preview users. Please contact joey@pachyderm.io to become a private preview user. This is a preview, do not use it for production!
How to set up AML with Pachyderm:
Email moira.chambers@pachyderm.io with your Azure Subscription ID and we'll arrange for custom datastores to be enabled on your account. This allows the rest of the instructions to work.
Clone this repo.
git clone https://github.com/pachyderm/aml
cd aml
Log into Azure:
az login
Choose the Azure region where you want to deploy all resources:
export TF_VAR_location="East US"
Note: if you're deploying with an existing AzureML workspace, the location above should match where your workspace is.
Optionally, specify the type of data you want to store.
FileDataset
: matches all files
, for unstructured data OR
- JSON Lines format (
jsonl
, matches*.jsonl
and aggregates into a single table, in which case all json lines files in a given Pachyderm repo must have compatible schemas) - CSV format (
delmited
, matches*.csv
and aggregates into a single table - in which case all csv files in a given Pachyderm repo must have compatible schemas)
For example:
export TF_VAR_pachyderm_syncer_mode="files" # or "jsonl" or "delimited"
Now we'll deploy a Kubernetes cluster, install Pachyderm on it, then start the Syncer VM.
bash scripts/setup.sh
Note: if you get errors about exceeding quota, try a different region by configuring
TF_VAR_location
If you're attaching AzureML-Pachyderm to an existing AzureML workspace, specify the resource group that the target AzureML workspace is in here, as well as specifying the workspace name:
export TF_VAR_existing_resource_group_name="existing-resource-group"
export TF_VAR_existing_workspace_name="existing-workspace"
bash scripts/setup.sh
(You can also create a new AzureML workspace in an existing resource group by only specifying TF_VAR_existing_resource_group_name
but not TF_VAR_existing_workspace_name
.)
Option 3: Only create a new Syncer VM, and integrate exiting Azure ML workspace with existing Pachyderm cluster
We will only create a new VM for the Syncer, and adopt existing Pachyderm and AML infrastructure. You will need to copy the Terraform code to a fresh new directory.
mkdir syncer1 # recommend naming this as syncer-$workspace_name
cp terraform/*.tf syncer1
cp -R terraform/out/ syncer1/out # copy kubeconfig, env.sh and helmvalues, which setup.sh depends on
Setup appropriate environment variables.
export TERRAFORM_WD=syncer1
export TF_VAR_skip_pachyderm_deploy=1
export TF_VAR_existing_resource_group_name="existing-resource-group"
export TF_VAR_existing_workspace_name="existing-workspace"
Run the setup script and wait for the Syncer VM to come online.
bash scripts/setup.sh
Note: the Syncer VM is based on a Marketplace VM image built using packer. For more info go to VM image docs
The default username on the Syncer VM is pachyderm
. This is useful for debugging and quickly fixing issues that might appear in the Syncer. You can also use the syncer's builtin pachctl
to operate Pachyderm.
# From the root aml repo directory
ssh pachyderm@$(cd terraform; terraform output -raw instance_ip)
Install a custom built version of the azureml-dataprep-rslex
library.
Note: this step will no longer be necessary after Microsoft releases an official library with the Pachyderm integration built-in.
From an AML notebook (create a new file in the "Notebooks" tab), connect to the compute instance you want to use with Pachyderm (creating one through the UI if necessary), and run:
!curl -sSL https://raw.githubusercontent.com/pachyderm/aml/main/scripts/install-rslex-custom.sh | bash
Restart the Python Kernel for your notebook after the installation completes, for the changes to take effect.
Note: there might be some errors related to incompatible package versions, you can simply ignore those.
- Install pachctl
We need to get the kubeconfig from Terraform so that we can authenticate against the remote K8s cluster.
From your aml
repo, run:
(cd terraform; terraform output -raw kube_config) > kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig
pachctl config import-kube aml -k $(cd terraform; terraform output -raw kube_context) --overwrite && pachctl config set active-context aml
pachctl version
You should see that your local pachctl
is able to connect to your Pachyderm cluster.
You can now insert data as described in the tutorial.
This tutorial uses structured JSON data, which requires configuring TF_VAR_pachyderm_syncer_mode="jsonl"
.
First, create a Pachyderm repo:
pachctl create repo poker
Next, add some data:
cat <<EOF | pachctl put file poker@master:/poker.jsonl
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
EOF
Then, go to the Datasets page in AML and observe that Pachyderm commits are automatically populated in AML as Dataset versions! For a specific dataset version, click Consume and copy and paste the code into an AML notebook. Run it, and note that the data is visible.
The Consume code should look something like:
from azureml.core import Workspace, Dataset
subscription_id = ''
resource_group = ''
workspace_name = ''
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='Pachyderm repo poker - jsonl')
dataset.to_pandas_dataframe()
Note: if you get errors, double check 1) the version of your azureml-dataprep libraries and make sure you followed Step 2 and 2) the data you stored is valid.
Lets create a new version of the data:
cat <<EOF | pachctl put file poker@master:/poker.jsonl
{"name": "Albert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Joey", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "Luke", "wins": []}
{"name": "Alysha", "wins": [["three of a kind", "5♣"]]}
EOF
Now re-run the Consume code and show that it's updated. As the a-ha moment, go back to the previous version and add version="1"
to Dataset.get_by_name()
and show that you see the old version of the data - a-ha! Data versioning & reproducibility!