We are using the Kubernetes API to retrieve all the pods, then we iterate over them to check their statuses, and when the pod is failed, or pending for more than X minutes (the "X" minutes is an option passed through variable), we increment our metric of crashed pods by namespace, to send it later to CloudWatch as a custom metric where we could better analyse the information and create some alerts on it.
When running as a pod inside a Kubernetes cluster, we are going to use a service account, that it will give a bearer token to call the Kubernetes API in a transparent fashion.
When running as a container it is required to pass a kubeconfig file to interact with some cluster.
-
The Kubernetes API used must be accessible from the location where this pod/container is running.
-
In order to send metrics to CloudWatch it is required an user with credentials for that, more instructions on how to create this user here: Create CloudWatch User
Running as container
# build image
docker image build -t docker.io/juliocesarmidia/kube-pod-metrics-collector:v1.0.0 ./src
# or pull from docker hub
docker image pull docker.io/juliocesarmidia/kube-pod-metrics-collector:v1.0.0
# run without sending metrics to CloudWatch - dry run
docker container run --rm -d \
--name pod-metrics \
--restart 'no' \
--network host \
-e RUNNING_IN_KUBERNETES='0' \
-e SCHEDULE_SECONDS_INTERVAL='60' \
-e PENDING_MINS_TO_BE_CRASHED='1' \
-e IGNORE_NAMESPACES='kube-public,kube-node-lease' \
-e SEND_TO_CLOUDWATCH='0' \
-e KUBECONFIG='/root/.kube/config' \
-e KUBECONTEXT=$(kubectl config current-context) \
-v $HOME/.kube/config:$HOME/.kube/config \
docker.io/juliocesarmidia/kube-pod-metrics-collector:v1.0.0
# logs and stats
docker container logs -f pod-metrics
docker stats pod-metrics
# run sending metrics to CloudWatch (it requires AWS credentials)
docker container run --rm -d \
--name pod-metrics \
--restart 'no' \
--network host \
-e RUNNING_IN_KUBERNETES='0' \
-e SCHEDULE_SECONDS_INTERVAL='60' \
-e PENDING_MINS_TO_BE_CRASHED='1' \
-e IGNORE_NAMESPACES='kube-public,kube-node-lease' \
-e SEND_TO_CLOUDWATCH='1' \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_DEFAULT_REGION \
-e KUBECONFIG='$HOME/.kube/config' \
-e KUBECONTEXT=$(kubectl config current-context) \
-v $HOME/.kube/config:/root/.kube/config \
docker.io/juliocesarmidia/kube-pod-metrics-collector:v1.0.0
# clean up
docker container rm -f pod-metrics
Running inside Kubernetes as pod
# create a secret for CloudWatch sdk usage, with the AWS credentials
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: pod-metrics-secrets
namespace: default
data:
AWS_ACCESS_KEY_ID: "$(echo -n "$AWS_ACCESS_KEY_ID" | base64 -w0)"
AWS_SECRET_ACCESS_KEY: "$(echo -n "$AWS_SECRET_ACCESS_KEY" | base64 -w0)"
AWS_DEFAULT_REGION: "$(echo -n "$AWS_DEFAULT_REGION" | base64 -w0)"
EOF
export CLUSTER_NAME=$(kubectl config current-context)
# create a configmap for CloudWatch sdk usage, with the cluster name
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-metrics-configmap
namespace: default
data:
CLUSTER_NAME: "$CLUSTER_NAME"
EOF
# check secret and cm
kubectl get secret/pod-metrics-secrets -n default -o yaml
kubectl get configmap/pod-metrics-configmap -n default -o yaml
# create the deployment
kubectl apply -f deployment.yaml
# check pod execution
kubectl get pod -l app=pod-metrics -n default
kubectl describe pod -l app=pod-metrics -n default
kubectl top pod -l app=pod-metrics -n default
# see logs
kubectl logs -f -l app=pod-metrics -n default --tail 1000 --timestamps
kubectl logs -f deploy/pod-metrics -n default --tail 1000 -c pod-metrics --timestamps
# command to execute "sh" inside the pod
kubectl exec -it pod/pod-metrics -n default -- sh
# calling the Kubernetes API inside the pod
API_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
API_SERVER="https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT_HTTPS}"
curl \
--cacert '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $API_TOKEN" \
--url "${API_SERVER}/api"
# clean up
kubectl delete -f deployment.yaml
kubectl delete secret/pod-metrics-secrets -n default
kubectl delete configmap/pod-metrics-configmap -n default