Users submitting Apache Spark applications on Kubernetes often struggle to get access to Spark UI. If they are using Spark Operator they can setup the ingress through Spark Appliction CRD, however, when using spark-submit
or other methods they need to have access to the Kubernetes cluster which is not always trivial and is a paper cut action this project aims to solve. Auto Register Spark UI for Apache Spark applications running Kubernetes is a light weight Kubernetes controller that provides a solution to automatically register Spark UIs running on Kubernetes with a central proxy server. It simplifies the management and access of Spark UIs by dynamically updating the proxy server with the Spark UI endpoints as users submit spark jobs or run session from Jupyter notebooks.
- Automatic Registration: Automatically registers Spark UIs with a central proxy server.
- Dynamic Updates: Updates the proxy server with new Spark UI endpoints as they are created and remove them when the spark application finishes running.
- Customizable: Easily configurable to fit different deployment environments.
The Auto Register Spark UI controller operates by utilizing a Kubernetes informer to listen for newly created services within the cluster. It specifically filters these services to identify those associated with Spark applications by checking for a predefined Spark application selector spark-app-selector
. Once a matching service is detected, the controller dynamically creates an Ingress resource and add to it a path to expose the Spark UI endpoint. This Ingress is configured to route traffic to the Spark UI service, thereby making it accessible through the central proxy server. Additionally, the controller ensures that the Ingress and/or the path are removed when the Spark application completes, maintaining an up-to-date and clean environment. The controller create a path based on the name of the spark application or namespace where the spark application is submitted and spark application. The following is the URL format, http://ngnix-ip-or-dns-name/spark-application-name/
or http://ngnix-ip-or-dns-name/spark-namespace/spark-application-name/
.
kubectl
command-line tool- Docker #optional
- Nginx Controller deployed in the Kubernete cluster.
- Clone the repository:
git clone https://github.com/lmouhib/auto-register-spark-ui-k8s.git
cd auto-register-spark-ui-k8s
- Deploy the Spark UI Auto Register Service:
The deployment below use an image built and hosted in the following repository lmouhib/auto-register-spark-ui-k8s
, you can use the provided Dockerfile in this repository to build and deploy your own image.
kubectl apply -f auto_register_spark_ui_deployment.yaml
-
Customizing the controller behavior The controller expose certain paramter to configure how the Spark UI path is constructed or how it is detected.
SPARK_LABEL_SERVICE_SELECTOR: The label selector used to identify Spark services. Default is "spark-app-selector".
SPARK_NAMESPACE: The Kubernetes namespace where the Spark jobs are running. Default the controller will listen to all namespaces.
NAMESPACED_INGRESS_PATH: Whether to use namespaced ingress paths. Default is "true".
INGRESS_NAME: The name of the ingress resource to be created. Default is "spark-ui-ingress".
INGRESS_TYPE: The ingress controller available in the cluster, only two are supported:
nginx
ortraefik
.AUTHENTICATION_SETUP: The name of the secret that holds the authentication file for
nginx
ortraefik
.
-
NGINX
- In order for the controller to create and manage Ingress based on NGINX, you need to set the environment variable
INGRESS_TYPE
tonginx
.
- In order for the controller to create and manage Ingress based on NGINX, you need to set the environment variable
-
Traefik
-
In order for the controller to create and manage Ingress based on Traefik, you need to set the environment variable
INGRESS_TYPE
totraefik
. -
When using
traefik
the controller will create a atraefik Middlware
to strip the url and forward it to spark driver. In order manage the lifecycle of the middleware, you need to uncomment the in the cluster role the policy that allow to manage the Middlware. Withouth it the controller will fail.
-
In order for the Spark UI to be exposed correctly, you need to configure the spark application to use the nginx proxy. Below you will find an example with spark-submit
. Note, the spark.ui.proxyRedirectUri
can be defined as default in the spark-default
file.
./bin/spark-submit \
--master k8s://https://x.x.x.x \
--deploy-mode cluster \
--name pi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=spark:latest \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.ui.proxyRedirectUri=http://ngnix-ip-or-dns-name \
--conf spark.ui.proxyBase=/pi \
local:///opt/spark/examples/src/main/python/pi.py 1000000
Note: You can get proxyRedirectUri
by running the command: kubectl get svc ingress-nginx-controller -n ingress-nginx
.
The command assume you deployed nginx
in the following namespace ingress-nginx
, and the name of nginx
service of type LOAD BALANCER
is ingress-nginx-controller
and will provide an EXTERNAL-IP
that should be used as proxyRedirectUri
.
The Spark UI for the application submitted above would be in the following address : http://ngnix-ip-or-dns-name/pi/
.
To set up authentication, follow these steps:
- Create a password file using
htpasswd
:
htpasswd -c auth <username>
You will be prompted to enter and confirm a password for the specified username. The command will create a file name auth
.
- Create a Kubernetes secret:
Once the password file is created, you need to create a Kubernetes secret to store the authentication data. Use the following command to create the secret, ensuring the key is named auth:
kubectl create secret generic spark-ui-auth --from-file=auth=./auth
if you change the name of the secret make sure to change also its name in the RBAC
as well as in the environment variable AUTHENTICATION_SETUP
in the file auto_register_spark_ui_deployment.yaml
.
- Deploy the controller, when a new ingress is created by the conroller, it will add the necessary annotation as well as the create the middleware for traefik to use the sercret that contains the list of username and password you created in step 1 and 2.
- Currently this controller work only with
NGINX
, one improvement is to make it generic and to work with any ingress that suport URL rewrite or prefix stripping. - Add unit tests
- Provide build instructions to push image to Image repository for cloud providers.
- Refactor the controller to use configmap and not environment variables.