In the Azure Distributed Data Engineering Toolkit, a cluster is primarily designed to run Spark jobs. This document describes how to create a cluster to use for Spark jobs. Alternatively for getting started and debugging you can also use the cluster in interactive mode which will allow you to log into the master node and interact with the cluster from there.
Creating a Spark cluster only takes a few simple steps after which you will be able to SSH into the master node of the cluster and interact with Spark. You will be able to view the Spark Web UI, Spark Jobs UI, submit Spark jobs (with spark-submit), and even interact with Spark in a Jupyter notebook.
For the advanced user, please note that the default cluster settings are preconfigured in the .aztk/cluster.yaml file that is generated when you run aztk spark init
. More information on cluster config here.
Create a Spark cluster:
aztk spark cluster create --id <your_cluster_id> --vm-size <vm_size_name> --size <number_of_nodes>
For example, to create a cluster of 4 Standard_A2 nodes called 'spark' you can run:
aztk spark cluster create --id spark --vm-size standard_a2 --size 4
You can find more information on VM sizes here. Please note that you must use the official SKU name when setting your VM size - they usually come in the form: "standard_d2_v2".
Note: The cluster id (--id
) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters. Each cluster must have a unique cluster id.
By default, you cannot create clusters of more than 20 cores in total. Visit this page to request a core quota increase.
You can create your cluster with low-priority VMs at an 80% discount by using --size-low-pri
instead of --size
. Note that these are great for experimental use, but can be taken away at any time. We recommend against this option when doing long running jobs or for critical workloads.
You can create clusters with a mixed of low-priority and dedicated VMs to reach the optimal balance of price and availability. In Mixed Mode, your cluster will have both dedicated instances and low priority instances. To minimize the potential impact on your Spark workloads, the Spark master node will always be provisioned on one of the dedicated nodes while each of the low priority nodes will be Spark workers.
Please note, to use Mixed Mode clusters, you need to authenticate using Azure Active Directory (AAD) by configuring the Service Principal in .aztk/secrets.yaml
. You also need to create a Virtual Network (VNET), and provide the resource ID to a Subnet within the VNET in your ./aztk/cluster.yaml` configuration file.
By default, the Azure Distributed Data Engineering Toolkit will use Spark v2.2.0 and Python v3.5.4. However, you can set your Spark and/or Python versions by configuring the base Docker image used by this package.
You can list all clusters currently running in your account by running
aztk spark cluster list
To view details about a particular cluster run:
aztk spark cluster get --id <your_cluster_id>
Note that the cluster is not fully usable until a master node has been selected and it's state is idle
.
For example here cluster 'spark' has 2 nodes and node tvm-257509324_2-20170820t200959z
is the master and ready to run a job.
Cluster spark
------------------------------------------
State: active
Node Size: standard_a2
Nodes: 2
| Dedicated: 2
| Low priority: 0
Nodes | State | IP:Port | Master
------------------------------------|-----------------|----------------------|--------
tvm-257509324_1-20170820t200959z | idle | 40.83.254.90:50001 |
tvm-257509324_2-20170820t200959z | idle | 40.83.254.90:50000 | *
To delete a cluster run:
aztk spark cluster delete --id <your_cluster_id>
Deleting a cluster also permanently deletes any data or logs associated with that cluster. If you wish to persist this data, use the --keep-logs
flag.
To delete all clusters:
aztk spark cluster delete --id $(aztk spark cluster list -q)
Skip delete confirmation by using the --force
flag.
You are charged for the cluster as long as the nodes are provisioned in your account. Make sure to delete any clusters you are not using to avoid unwanted costs.
To run a command on all nodes in the cluster, run:
aztk spark cluster run --id <your_cluster_id> "<command>"
The command is executed through an SSH tunnel.
To run a command on all nodes in the cluster, run:
aztk spark cluster run --id <your_cluster_id> --node-id <your_node_id> "<command>"
This command is executed through a SSH tunnel.
To get the id of nodes in your cluster, run aztk spark cluster get --id <your_cluster_id>
.
To securely copy a file to all nodes, run:
aztk spark cluster copy --id <your_cluster_id> --source-path </path/to/local/file> --dest-path </path/on/node>
The file will be securely copied to each node using SFTP.
All other interaction with the cluster is done via SSH and SSH tunneling. If you didn't create a user during cluster create (aztk spark cluster create
), the first step is to add a user to each node in the cluster.
Make sure that the .aztk/secrets.yaml file has your SSH key (or path to the SSH key), and it will automatically use it to make the SSH connection.
aztk spark cluster add-user --id spark --username admin
Alternatively, you can add the SSH key as a parameter when running the add-user
command.
aztk spark cluster add-user --id spark --username admin --ssh-key <your_key_OR_path_to_key>
You can also use a password to create your user:
aztk spark cluster add-user --id spark --username admin --password <my_password>
Using a SSH key is the recommended method.
After a user has been created, SSH into the Spark container on the master node with:
aztk spark cluster ssh --id spark --username admin
If you would like to ssh into the host instead of the Spark container on it, run:
aztk spark cluster ssh --id spark --username admin --host
If you ssh into the host and wish to access the running Docker Spark environment, you can run the following:
sudo docker exec -it spark /bin/bash
Now that you're in, you can change directory to your familiar $SPARK_HOME
cd $SPARK_HOME
To view the SSH command being called, pass the --no-connect
flag:
aztk spark cluster ssh --id spark --no-connect
Note that an SSH tunnel and shell will be opened with the default SSH client if one is present. Otherwise, a pure python SSH tunnel is created to forward the necessary ports. The pure python SSH tunnel will not open a shell.
If your cluster is in an unknown or unusable state, you can debug by running:
aztk spark cluster debug --id <cluster-id> --output </path/to/output/directory/>
The debug utility will pull logs from all nodes in the cluster. The utility will check for:
- free diskspace
- docker image status
- docker container status
- docker container logs
- docker container process status
- aztk code & version
- spark component logs (master, worker, shuffle service, history server, etc) from $SPARK_HOME/logs
- spark application logs from $SPARK_HOME/work
Please be careful sharing the output of the debug
command as secrets and application code are present in the output.
Pass the --brief
flag to only download the most essential logs from each node:
aztk spark cluster debug --id <cluster-id> --output </path/to/output/directory/> --brief
This command will retrieve:
- stdout file from the node's startup
- stderr file from the node's startup
- the docker log for the spark container
By default, the aztk spark cluster ssh
command port forwards the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and Spark History Server to your localhost:18080. This can be configured in .aztk/ssh.yaml.