This document details instructions to install Delight on Google Cloud Dataproc.
It assumes that you have created an account and generated an access token on the Delight website.
There are two ways to run Spark application on Dataproc:
- as a job on an existing cluster,
- as a Spark step in a worflow run on an existing or an ephemeral cluster (so-called "managed" cluster)
We detail instructions for both cases below.
Follow these instructions to create a cluster and run a Spark job.
When configuring the job, add the following properties to the application:
spark.jars.repositories: https://oss.sonatype.org/content/repositories/snapshots
spark.jars.packages: co.datamechanics:delight_<replace-with-your-scala-version-2.11-or-2.12>:latest-SNAPSHOT
spark.extraListeners: co.datamechanics.delight.DelightListener
spark.delight.accessToken.secret: <replace-with-your-access-token>
Don't forget to replace the placeholders! The Scala version depends on the Dataproc distribution you're using:
Dataproc version | Spark version | Scala version |
---|---|---|
preview | 3.0.1 | 2.12 |
1.5 | 2.4.7 | 2.12 |
1.4 | 2.4.7 | 2.11 |
1.3 | 2.3.4 | 2.11 |
Please refer to the official Dataproc documentation to learn more about releases.
Spark applications can be run as Spark steps in a worflow. In this section, we create an example workflow template with a Spark step and execute it.
More details are available in these instructions in the Dataproc documentation. To enable Delight, we simply add some Spark properties to the Spark step.
The script below shows how to do this for an example Spark Pi application running on an ephemeral cluster.
Don't forget to replace the API key placeholder!
TEMPLATE_NAME=delight-test-template
REGION=us-west1
CLUSTER_NAME=delight-test-cluster
PROPERTIES=(
"spark.jars.repositories=https://oss.sonatype.org/content/repositories/snapshots"
"spark.jars.packages=co.datamechanics:delight_2.12:latest-SNAPSHOT"
"spark.extraListeners=co.datamechanics.delight.DelightListener"
"spark.delight.accessToken.secret=<replace-with-your-access-token>"
)
function join { local IFS="$1"; shift; echo "$*"; }
gcloud dataproc workflow-templates create $TEMPLATE_NAME \
--region=$REGION
gcloud dataproc workflow-templates set-managed-cluster $TEMPLATE_NAME \
--region=$REGION \
--master-machine-type=n1-standard-2 \
--worker-machine-type=n1-standard-2 \
--num-workers=2 \
--cluster-name=$CLUSTER_NAME \
--image-version preview
gcloud dataproc workflow-templates add-job spark \
--workflow-template=$TEMPLATE_NAME \
--region=$REGION \
--step-id=spark-step \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
--properties=$(join "," ${PROPERTIES[@]}) \
-- 1000
gcloud dataproc workflow-templates instantiate $TEMPLATE_NAME \
--region=$REGION
Then move on to the Dataproc console to see the cluster being created and the job execution.
Note that in the example script we use the Dataproc version preview
(parameter --image-version
in set-managed-cluster
) and we set the Scala version accordingly to 2.12 in co.datamechanics:delight_2.12:latest-SNAPSHOT
.
If you use another Dataproc version, you will have to adjust the Scala version:
Dataproc version | Spark version | Scala version |
---|---|---|
preview | 3.0.1 | 2.12 |
1.5 | 2.4.7 | 2.12 |
1.4 | 2.4.7 | 2.11 |
1.3 | 2.3.4 | 2.11 |
Please refer to the official Dataproc documentation to learn more about releases.