Common issues and solutions

Here's some of the common issues and resolutions that have been encountered in the environment.

Kubernetes

Connect to a Kubernetes Node

The following daemonset is often deployed already based on the platform Terraform.

https://gist.github.com/zachomedia/2a3a799a1468915de7414f2bcacda984

In order to connect to the node first you need to exec into one of the sh-* containers.

kubectl exec -it -n kube-system sh-4zfdd sh

Now you can execute the following chroot command which will give you the node context.

chroot /mnt /bin/bash

You will then be greeted by the following entry:

root@aks-nodepool1-XXXXXXXX-vmss00006Q:/#

NodeDiskPressure on individual Kubernetes Node

Usually if a Node has NodeDiskPressure if might be hard to do the connect to Node workflow. In this case you will have to connect to another Node and then ssh onto the Node itself which has NodeDiskPressure

In order to connect to the node first you need to exec into one of the sh-* that don't have the NodeDiskPressure.

kubectl exec -it -n kube-system sh-4zfdd sh

When you are in the container you will need to ensure you have the SSH client.

apk --update add openssh-client

At this point you should now be able to SSH into the node with NodeDiskPressure. You can get the SSH key in our Vault instance.

ssh -i id_rsa azureuser@aks-nodepool1-XXXXX-vmss0000XX

Once connect to the node you can determine the where the majority of the disk pressure is occurring by using the df command.

df -h

For a more interactive tool if the df command is not sufficient you can use ncdu command.

apt-get install ncdu -y
ncdu -x /

Slow API calls or command failures (context deadline, etc.)

If gatekeeper has crashed, it will begin to block commands that require further validation. Unfortunately, due to a bug, gatekeeper will not restart on its own and requires manual intervention.

It may be seen when running commands such as kubectl logs, but can present in other ways. One common error message is:

Error from server (InternalError): Internal error occurred: Authorization error (user=$USER, verb=get, resource=nodes, subresource=proxy)

Update the failure pocliy:

kubectl patch validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

Once gatekeeper is running (1/1), run the following to restore the failure policy:

kubectl patch validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'

In the event that gatekeeper is 1/1 and you are still having issues:

kubectl -n gatekeeper-system rollout restart deployment gatekeeper-controller-manager
kubectl -n gatekeeper-system scale rs gatekeeper-controller-manager-$OLDRSID --replicas=0

This should allow the new pod to successfully start.

Slow API calls or command failures with OPA GateKeeper ruled out

There can be extremely rare times where there are intermittent communication issues between the Kubernetes API server.

You simply have to just restart the Tunnelfront pod after confirming the logs that there have been disconnects.

Pods stuck in terminating

Sometimes, a Pod may be stuck in terminating preventing it from being rescheduled. This can be caused by Boathouse being unable to unmount its drives.

To verify this, you can check the kubelet logs. On a Node, you may use the following:

journalctl -u kubelet --since "15 minutes ago"

To remove the Pod forcefully:

kubectl delete pod <pod_name> -n <namespace> --grace-period 0 --force

Kubeflow

Connection errors through the browser (Upstream connect)

Try restarting the following components:

kubectl -n kubeflow rollout restart deploy centraldashboard
kubectl -n Istio-system rollout restart statefulset authservice

Properly deleting a User

The following deletes the profile and the namespace as well.

kubectl get profile
kubectl delete profile <username>

In case namespace is still not deleted, despite the profile being deleted.

kubectl delete ns <username>

Corrupted Indexes with Kibana

Deleted corrupted Index:

curl -X DELETE -v --user elastic:$(k -n daaas get secret daaas-es-elastic-user '--template={{ .data.elastic }}' | base64 --decode) https://elastic.covid.cloud.statcan.ca/.kibana_5

Deleted corrupted Document:

curl -X DELETE -v --user elastic:$PASSWORD https://elastic.covid.cloud.statcan.ca/.kibana_4/_doc/test-test-test

Deleted corrupted Visualization:

curl -X DELETE -v --user elastic:$PASS https://elastic.covid.cloud.statcan.ca/.kibana_4/_doc/visualization:rl_learning_daily_testing_july4

Managed PostgreSQL

Connectivity issues to Managed DB from Kubernetes cluster:

Enable VNET peering
Add Containers subnet to firewall on database
Add Istio ServiceEntry for db
Enable Service Endpoints for Microsoft.Sql on Kubernetes virtual network

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: managed-postgresql-db
  namespace: default
spec:
  addresses:
  - XX.XXX.XXX.XXX
  hosts:
  - manageddb.postgres.database.azure.com
  location: MESH_EXTERNAL
  ports:
  - name: psql
    number: 5432
    protocol: TLS
  resolution: DNS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly