Please Note: This solution has not been extensively tested. If you encounter any unexpected, don't hesistate to open an issue.
An implemention of Horizontal Pod Autoscaling based on GPU metrics using the following components:
- DCGM Exporter which exports GPU metrics for each workload that uses GPUs. We selected the GPU utilization metric (
dcgm_gpu_utilization
) for this example. - Prometheus which collects the metrics coming from the DCGM Exporter and transforms them into a metric that can be used for autoscaling deployments.
- Prometheus Adapter which redirects the autoscale metric served by Prometheus to the k8s custom metrics API
custom.metrics.k8s.io
so that they're used by theHorizontalPodAutoscaler
controller.
The following steps detail how to configure autoscaling for a GPU workload:
To follow this walkthrough, you need the following:
- A kubernetes cluster (version 1.6 or higher) with at least one Nvidia GPU (drivers properly installed and
k8s-device-plugin
deployed) kubectl
andhelm
(version 3 or higher)jq
andcurl
[optional]
We'll use this label when installing dcgm-exporter
to select nodes that have Nvidia GPUs:
kubectl label nodes {node} accelerator=nvidia-gpu
Note: If you're running on GKE, EKS, or AKS and you have cluster autoscaling enabled, make sure the label is automatically attached to the node when it's created.
We opted not to use the dcgm-exporter
chart and apply the DaemonSet
and Service
directly:
kubectl apply -f dcgm-exporter.yaml
Note: In
dcgm-exporter.yaml
, make sure thehostPath
value in thelibnvidia
volume matches theLD_LIBRARY_PATH
in your node(s). It may not necessarily be/home/kubernetes/bin/nvidia/lib64/
.
Once dcgm-exporter
is running, we can query its /metrics
endpoint for GPU temperatures of one of the nodes for example:
kubectl port-forward svc/dcgm-exporter 9400:9400 # run this in a separate terminal
curl localhost:9400/metrics | grep dcgm_gpu_temp
We add the prometheus-community
helm repository which contains kube-prometheus-stack
and prometheus-adapter
:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
As we install prometheus through the helm chart an additionalScrapeConfigs
which creates a job to scrape metrics exported by dcgm-exporter
to /metrics
:
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack -f kube-prometheus-stack-values.yaml
We can now create a recording rule to periodically compute and expose the autoscale metric (called cuda_test_gpu_avg
in this example) for our deployment (called cuda-test
):
kubectl apply -f cuda-test-prometheusrule.yaml
This metric is computed using a PromQL query where we average the dcgm_gpu_utilization
values of all GPUs used by a replica of our deployment:
avg(
max by(node, pod, namespace) (dcgm_gpu_utilization)
* on(pod) group_left(label_app)
max by(pod, label_app) (kube_pod_labels{label_app="cuda-test"})
)
Once prometheus is fully running, we can create our deployment (or there wouldn't be any positive dcgm_gpu_utilization
values to average). Our test workload is a loop of calls to the vectorAdd
script often used to test that the cluster can successfully run CUDA containers.
The custom metric should be available to query through Prometheus within 30 seconds:
kubectl apply -f cuda-test-deployment.yaml
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 # run this in a separate terminal
curl localhost:9090/api/v1/query?query=cuda_test_gpu_avg
As we install the adapter, we only need to point it to the prometheus service:
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --set prometheus.url="http://kube-prometheus-stack-prometheus.default.svc.cluster.local"
The custom metric should available in the custom.metrics.k8s.io
API. We use jq
to format the response:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep cuda_test_gpu_avg
We can now create our HorizontalPodAutoscaler
resource:
kubectl apply -f cuda-test-hpa.yaml
The HorizontalPodAutoscaler
should add an extra replica or more once the GPU utilization by the original replica reaches the target value of 4%. If the GPU utilization constantly remains under that, you can kubectl exec
into the pod and manually double the workload to increase the value of cuda_test_gpu_avg
by running:
for (( c=1; c<=5000; c++ )); do ./vectorAdd; done
Now we list the pods that belong to our deployment, and hopefully, a second replica will be added:
kubectl get pod | grep cuda-test
Naturally, if the usage drops low enough, a scaledown will occur.
Note: By default, the autoscaler might create more than one replica (perhaps creating as many as specified in the
maxReplicas
value). This is because thedcgm-exporter
updates its metrics every 10 seconds and the pod may take a while to pull the image and start the container. This delay in starting and capturing the effect of the replicas causes the HPA to keep adding them. You can more-or-less control this behavior if you useautoscaling/v2beta1
.