copyright

lastupdated

years
2014, 2018

2018-11-13

{:new_window: target="_blank"} {:shortdesc: .shortdesc} {:screen: .screen} {:pre: .pre} {:table: .aria-labeledby="caption"} {:codeblock: .codeblock} {:tip: .tip} {:note: .note} {:important: .important} {:deprecated: .deprecated} {:download: .download} {:tsSymptoms: .tsSymptoms} {:tsCauses: .tsCauses} {:tsResolve: .tsResolve}

Debugging your cluster

{: #cs_troubleshoot}

As you use {{site.data.keyword.containerlong}}, consider these techniques for general troubleshooting and debugging your clusters. You can also check the status of the {{site.data.keyword.Bluemix_notm}} system External link icon . {: shortdesc}

You can take these general steps to ensure that your clusters are up-to-date:

Check monthly for available security and operating system patches to update your worker nodes.
Update your cluster to the latest default version of Kubernetes for {{site.data.keyword.containerlong_notm}}

Debugging clusters

{: #debug_clusters}

Review the options to debug your clusters and find the root causes for failures.

List your cluster and find the State of the cluster.

ibmcloud ks clusters

{: pre}

Review the State of your cluster. If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, start debugging the worker nodes.

Cluster states

Cluster state	Description
Aborted	The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help).
Critical	The Kubernetes master cannot be reached or all worker nodes in the cluster are down.
Delete failed	The Kubernetes master or at least one worker node cannot be deleted.
Deleted	The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help).
Deleting	The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster.
Deploy failed	The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help).
Deploying	The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster.
Normal	All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as [networking](cs_troubleshoot_network.html) and [storage](cs_troubleshoot_storage.html), might still need attention.
Pending	The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster.
Requested	A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to `Deploying`. If your cluster is stuck in the `Requested` state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help).
Updating	The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster.
Warning	At least one worker node in the cluster is not available, but other worker nodes are available and can take over the workload.

The Kubernetes master is the main component that keeps your cluster up and running. The master stores cluster resources and their configurations in the etcd database that serves as the single point of truth for your cluster. The Kubernetes API server is the main entry point for all cluster management requests from the worker nodes to the master, or when you want to interact with your cluster resources.

If a master failure occurs, your workloads continue to run on the worker nodes, but you cannot use kubectl commands to work with your cluster resources or view the cluster health until the Kubernetes API server in the master is back up. If a pod goes down during the master outage, the pod cannot be rescheduled until the worker node can reach the Kubernetes API server again.

During a master outage, you can still run ibmcloud ks commands against the {{site.data.keyword.containerlong_notm}} API to work with your infrastructure resources, such as worker nodes or VLANs. If you change the current cluster configuration by adding or removing worker nodes to the cluster, your changes do not happen until the master is back up.

Do not restart or reboot a worker node during a master outage. This action removes the pods from your worker node. Because the Kubernetes API server is unavailable, the pods cannot be rescheduled onto other worker nodes in the cluster. {: important}

Debugging worker nodes

{: #debug_worker_nodes}

Review the options to debug your worker nodes and find the root causes for failures.

If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, review the state of your worker nodes.

ibmcloud ks workers <cluster_name_or_id>

{: pre}

Review the State and Status field for every worker node in your CLI output.

Worker node states

Worker node state	Description
Critical	A worker node can go into a Critical state for many reasons: You initiated a reboot for your worker node without cordoning and draining your worker node. Rebooting a worker node can cause data corruption in `containerd`, `kubelet`, `kube-proxy`, and `calico`. The pods that are deployed to your worker node do not use resource limits for [memory ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/) and [CPU ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/). Without resource limits, pods can consume all available resources, leaving no resources for other pods to run on this worker node. This overcommitment of workload causes the worker node to fail. `containerd`, `kubelet`, or `calico` went into an unrecoverable state after it ran hundreds or thousands of containers over time. You set up a Virtual Router Appliance for your worker node that went down and cut off the communication between your worker node and the Kubernetes master. Current networking issues in {{site.data.keyword.containerlong_notm}} or IBM Cloud infrastructure (SoftLayer) that causes the communication between your worker node and the Kubernetes master to fail. Your worker node ran out of capacity. Check the Status of the worker node to see whether it shows Out of disk or Out of memory. If your worker node is out of capacity, consider to either reduce the workload on your worker node or add a worker node to your cluster to help load balance the workload. In many cases, [reloading](cs_cli_reference.html#cs_worker_reload) your worker node can solve the problem. When you reload your worker node, the latest [patch version](cs_versions.html#version_types) is applied to your worker node. The major and minor version is not changed. Before you reload your worker node, make sure to cordon and drain your worker node to ensure that the existing pods are terminated gracefully and rescheduled onto remaining worker nodes. If reloading the worker node does not resolve the issue, go to the next step to continue troubleshooting your worker node. Tip: You can [configure health checks for your worker node and enable Autorecovery](cs_health.html#autorecovery). If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like an OS reload on the worker node. For more information about how Autorecovery works, see the [Autorecovery blog ![External link icon](../icons/launch-glyph.svg "External link icon")](https://www.ibm.com/blogs/bluemix/2017/12/autorecovery-utilizes-consistent-hashing-high-availability/).
Deployed	Updates are successfully deployed to your worker node. After updates are deployed, {{site.data.keyword.containerlong_notm}} starts a health check on the worker node. After the health check is successful, the worker node goes into a `Normal` state. Worker nodes in a `Deployed` state usually are ready to receive workloads, which you can check by running `kubectl get nodes` and confirming that the state shows `Normal`.
Deploying	When you update the Kubernetes version of your worker node, your worker node is redeployed to install the updates. If you reload or reboot your worker node, the worker node is redeployed to automatically install the latest patch version. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the deployment.
Normal	Your worker node is fully provisioned and ready to be used in the cluster. This state is considered healthy and does not require an action from the user. Note: Although the worker nodes might be normal, other infrastructure resources, such as [networking](cs_troubleshoot_network.html) and [storage](cs_troubleshoot_storage.html), might still need attention.
Provisioning	Your worker node is being provisioned and is not available in the cluster yet. You can monitor the provisioning process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the provisioning.
Provision_failed	Your worker node could not be provisioned. Continue with the next step to find the details for the failure.
Reloading	Your worker node is being reloaded and is not available in the cluster. You can monitor the reloading process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the reloading.
Reloading_failed	Your worker node could not be reloaded. Continue with the next step to find the details for the failure.
Reload_pending	A request to reload or to update the Kubernetes version of your worker node is sent. When the worker node is being reloaded, the state changes to `Reloading`.
Unknown	The Kubernetes master is not reachable for one of the following reasons: You requested an update of your Kubernetes master. The state of the worker node cannot be retrieved during the update. You might have another firewall that is protecting your worker nodes, or changed firewall settings recently. {{site.data.keyword.containerlong_notm}} requires certain IP addresses and ports to be opened to allow communication from the worker node to the Kubernetes master and vice versa. For more information, see [Firewall prevents worker nodes from connecting](cs_troubleshoot_clusters.html#cs_firewall). The Kubernetes master is down. Contact {{site.data.keyword.Bluemix_notm}} support by opening an [{{site.data.keyword.Bluemix_notm}} support case](#ts_getting_help).
Warning	Your worker node is reaching the limit for memory or disk space. You can either reduce work load on your worker node or add a worker node to your cluster to help load balance the work load.

List the details for the worker node. If the details include an error message, review the list of common error messages for worker nodes to learn how to resolve the problem.

ibmcloud ks worker-get <worker_id>

{: pre}

ibmcloud ks worker-get [<cluster_name_or_id>] <worker_node_id>

{: pre}

Common issues with worker nodes

{: #common_worker_nodes_issues}

Review common error messages and learn how to resolve them.

Common error messages

Error message	Description and resolution
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Your account is currently prohibited from ordering 'Computing Instances'.	Your IBM Cloud infrastructure (SoftLayer) account might be restricted from ordering compute resources. Contact {{site.data.keyword.Bluemix_notm}} support by opening an [{{site.data.keyword.Bluemix_notm}} support case](#ts_getting_help).
{{site.data.keyword.Bluemix_notm}} infrastructure exception: Could not place order. {{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Could not place order. There are insufficient resources behind router 'router_name' to fulfill the request for the following guests: 'worker_id'.	The zone that you selected might not have enough infrastructure capacity to provision your worker nodes. Or, you might have exceeded a limit in your IBM Cloud infrastructure (SoftLayer) account. To resolve, try one of the following options: Infrastructure resource availability in zones can fluctuate often. Wait a few minutes and try again. For a single zone cluster, create the cluster in a different zone. For a multizone cluster, add a zone to the cluster. Specify a different pair of public and private VLANs for your worker nodes in your IBM Cloud infrastructure (SoftLayer) account. For worker nodes that are in a worker pool, you can use the `ibmcloud ks zone-network-set` [command](cs_cli_reference.html#cs_zone_network_set). Contact your IBM Cloud infrastructure (SoftLayer) account manager to verify that you do not exceed an account limit, such as a global quota. Open an [IBM Cloud infrastructure (SoftLayer) support case](#ts_getting_help)
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Could not obtain network VLAN with ID: <vlan id>.	Your worker node could not be provisioned because the selected VLAN ID could not be found for one of the following reasons: You might have specified the VLAN number instead of the VLAN ID. The VLAN number is 3 or 4 digits long, whereas the VLAN ID is 7 digits long. Run `ibmcloud ks vlans <zone>` to retrieve the VLAN ID. The VLAN ID might not be associated with the IBM Cloud infrastructure (SoftLayer) account that you use. Run `ibmcloud ks vlans <zone>` to list available VLAN IDs for your account. To change the IBM Cloud infrastructure (SoftLayer) account, see [`ibmcloud ks credential-set`](cs_cli_reference.html#cs_credentials_set).
SoftLayer_Exception_Order_InvalidLocation: The location provided for this order is invalid. (HTTP 500)	Your IBM Cloud infrastructure (SoftLayer) is not set up to order compute resources in the selected data center. Contact [{{site.data.keyword.Bluemix_notm}} support](#ts_getting_help) to verify that you account is set up correctly.
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: The user does not have the necessary {{site.data.keyword.Bluemix_notm}} Infrastructure permissions to add servers {{site.data.keyword.Bluemix_notm}} Infrastructure Exception: 'Item' must be ordered with permission. The {{site.data.keyword.Bluemix_notm}} infrastructure credentials could not be validated.	You might not have the required permissions to perform the action in your IBM Cloud infrastructure (SoftLayer) portfolio, or you are using the wrong infrastructure credentials. See [Setting up the API key to enable access to the infrastructure portfolio](cs_users.html#api_key).
Worker unable to talk to {{site.data.keyword.containerlong_notm}} servers. Please verify your firewall setup is allowing traffic from this worker.	If you have a firewall, [configure your firewall settings to allow outgoing traffic to the appropriate ports and IP addresses](cs_firewall.html#firewall_outbound). Check whether your cluster does not have a public IP by running `ibmcloud ks workers <mycluster>`. If no public IP is listed, then your cluster has only private VLANs. If you want the cluster to have only private VLANs, set up your [VLAN connection](cs_clusters_planning.html#private_clusters) and your [firewall](cs_firewall.html#firewall_outbound). If you want the cluster to have a public IP, [add new worker nodes](cs_cli_reference.html#cs_worker_add) with both public and private VLANs.
Cannot create IMS portal token, as no IMS account is linked to the selected BSS account Provided user not found or active SoftLayer_Exception_User_Customer_InvalidUserStatus: User account is currently cancel_pending. Waiting for machine to be visible to the user	The owner of the API key that is used to access the IBM Cloud infrastructure (SoftLayer) portfolio does not have the required permissions to perform the action, or might be pending deletion. As the user, follow these steps: If you have access to multiple accounts, make sure that you are logged in to the account where you want to work with {{site.data.keyword.containerlong_notm}}. Run `ibmcloud ks api-key-info` to view the current API key owner that is used to access the IBM Cloud infrastructure (SoftLayer) portfolio. Run `ibmcloud account list` to view the owner of the {{site.data.keyword.Bluemix_notm}} account that you currently use. Contact the owner of the {{site.data.keyword.Bluemix_notm}} account and report that the API key owner has insufficient permissions in IBM Cloud infrastructure (SoftLayer) or might be pending to be deleted. As the account owner, follow these steps: Review the [required permissions in IBM Cloud infrastructure (SoftLayer)](cs_users.html#infra_access) to perform the action that previously failed. Fix the permissions of the API key owner or create a new API key by using the [`ibmcloud ks api-key-reset`](cs_cli_reference.html#cs_api_key_reset) command. If you or another account admin manually set IBM Cloud infrastructure (SoftLayer) credentials in your account, run [`ibmcloud ks credential-unset`](cs_cli_reference.html#cs_credentials_unset) to remove the credentials from your account.

Debugging app deployments

{: #debug_apps}

Review the options that you have to debug your app deployments and find the root causes for failures.

Look for abnormalities in the service or deployment resources by running the describe command.

Example:

kubectl describe service <service_name>

Check whether the containers are stuck in the ContainerCreating state.
Check whether the cluster is in the Critical state. If the cluster is in a Critical state, check the firewall rules and verify that the master can communicate with the worker nodes.
Verify that the service is listening on the correct port.
1. Get the name of a pod.
```
kubectl get pods
```
1. Log in to a container.
```
kubectl exec -it <pod_name> -- /bin/bash
```
1. Curl the app from within the container. If the port is not accessible, the service might not be listening on the correct port or the app might have issues. Update the configuration file for the service with the correct port and redeploy or investigate potential issues with the app.
```
curl localhost: <port>
```
Verify that the service is linked correctly to the pods.
1. Get the name of a pod.
```
kubectl get pods
```
1. Log in to a container.
```
kubectl exec -it <pod_name> -- /bin/bash
```
1. Curl the cluster IP address and port of the service. If the IP address and port are not accessible, look at the endpoints for the service. If no endpoints are listed, then the selector for the service does not match the pods. If endpoints are listed, then look at the target port field on the service and make sure that the target port is the same as what is being used for the pods.
```
curl <cluster_IP>:<port>
```
For Ingress services, verify that the service is accessible from within the cluster.
1. Get the name of a pod.
```
kubectl get pods
```
1. Log in to a container.
```
kubectl exec -it <pod_name> -- /bin/bash
```
1. Curl the URL specified for the Ingress service. If the URL is not accessible, check for a firewall issue between the cluster and the external endpoint.
```
curl <host_name>.<domain>
```

Getting help and support

{: #ts_getting_help}

Still having issues with your cluster? {: shortdesc}

In the terminal, you are notified when updates to the ibmcloud CLI and plug-ins are available. Be sure to keep your CLI up-to-date so that you can use all the available commands and flags.
To see whether {{site.data.keyword.Bluemix_notm}} is available, check the {{site.data.keyword.Bluemix_notm}} status page .
Post a question in the {{site.data.keyword.containerlong_notm}} Slack . If you are not using an IBM ID for your {{site.data.keyword.Bluemix_notm}} account, request an invitation to this Slack. {: tip}
Review the forums to see whether other users ran into the same issue. When you use the forums to ask a question, tag your question so that it is seen by the {{site.data.keyword.Bluemix_notm}} development teams.
- If you have technical questions about developing or deploying clusters or apps with {{site.data.keyword.containerlong_notm}}, post your question on Stack Overflow and tag your question with ibm-cloud, kubernetes, and containers.
- For questions about the service and getting started instructions, use the IBM Developer Answers forum. Include the ibm-cloud and containers tags. See Getting help for more details about using the forums.
Contact IBM Support by opening a case. To learn about opening an IBM support case, or about support levels and case severities, see Contacting support. When you report an issue, include your cluster ID. To get your cluster ID, run ibmcloud ks clusters. {: tip}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs_troubleshoot.md

cs_troubleshoot.md

Debugging your cluster

Debugging clusters

Debugging worker nodes

Common issues with worker nodes

Debugging app deployments

Getting help and support

Files

cs_troubleshoot.md

Latest commit

History

cs_troubleshoot.md

File metadata and controls

Debugging your cluster

Debugging clusters

Debugging worker nodes

Common issues with worker nodes

Debugging app deployments

Getting help and support