This is a step-by-step guide to create an ephemeral Spark cluster on Kubernetes via Spark Operator for R + D purposes
This particular demo runs once a simple Spark app (which calculates pi number and prints output to STDOUT)
- Install the latest versions of kubectl and helm:
- Install kubectl plugin called "gke-gcloud-auth-plugin"
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
gcloud config set project dev-k8s-playground
gcloud services enable container
gcloud container clusters create crowdstrike-k8s-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-2
NB: using a too small machine for master node may result in infinite Pending
Connect kubectl
to cluster
gcloud container clusters get-credentials crowdstrike-k8s-cluster --zone us-central1-a --project dev-k8s-playground
kubectl version
helm version
helm repo add spark-operator
helm install spo-release spark-operator/spark-operator --namespace spark-ns --create-namespace --set webhook.enable=true
NB: SA spo-release-spark
will be auto-created, the exact name can be found via kubectl get serviceaccount -n spark-ns
kubectl apply -f spark-pi.yaml
kubectl get pods -n spark-ns
kubectl logs spark-pi-driver -n spark-ns
Other details:
kubectl get sparkapplications spark-pi -o=yaml -n spark-ns
kubectl describe sparkapplication spark-pi -n spark-ns
To see Spark UI you will probably need to increase the number of slices to make app running longer (spec.arguments 30 -> 10000)
kubectl port-forward spark-pi-driver -n spark-ns 4040:4040
Go to localhost:4040 to see Spark UI
Stop and delete all resources
gcloud container clusters delete crowdstrike-k8s-cluster --zone us-central1-a
The spark-operator in the following image does not work:
Using OSS apache image:
Preliminarily steps to prepare GCS bucket and SA:
gsutil mb -c nearline gs://spark-history-server-logs-test
export ACCOUNT_NAME=sparkonk8s
export GCP_PROJECT_ID=dev-k8s-playground
gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${ACCOUNT_NAME}@${GCP_PROJECT_ID}"
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}" --role roles/storage.admin
gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID} gs://spark-history-server-logs-test
Then create a secret using the JSON key file:
kubectl -n spark-ns create secret generic history-secrets --from-file=sparkonk8s.json
Creating a test job:
kubectl apply -f spark-pi-history-final.yaml
kubectl get pods -n spark-ns
kubectl logs spark-pi-driver -n spark-ns
All logs will be stored at GCS and available via Spark History Server UI
kubectl port-forward <spark-history-server-pod> -n spark-ns 18080:18080