copyright | lastupdated | ||
---|---|---|---|
|
2018-11-13 |
{:new_window: target="_blank"} {:shortdesc: .shortdesc} {:screen: .screen} {:pre: .pre} {:table: .aria-labeledby="caption"} {:codeblock: .codeblock} {:tip: .tip} {:note: .note} {:important: .important} {:deprecated: .deprecated} {:download: .download} {:tsSymptoms: .tsSymptoms} {:tsCauses: .tsCauses} {:tsResolve: .tsResolve}
{: #cs_troubleshoot}
As you use {{site.data.keyword.containerlong}}, consider these techniques for general troubleshooting and debugging your clusters. You can also check the status of the {{site.data.keyword.Bluemix_notm}} system . {: shortdesc}
You can take these general steps to ensure that your clusters are up-to-date:
- Check monthly for available security and operating system patches to update your worker nodes.
- Update your cluster to the latest default version of Kubernetes for {{site.data.keyword.containerlong_notm}}
{: #debug_clusters}
Review the options to debug your clusters and find the root causes for failures.
- List your cluster and find the
State
of the cluster.
ibmcloud ks clusters
{: pre}
-
Review the
State
of your cluster. If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, start debugging the worker nodes.Cluster states Cluster state Description Aborted The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help). Critical The Kubernetes master cannot be reached or all worker nodes in the cluster are down. Delete failed The Kubernetes master or at least one worker node cannot be deleted. Deleted The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help). Deleting The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster. Deploy failed The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help). Deploying The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster. Normal All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as [networking](cs_troubleshoot_network.html) and [storage](cs_troubleshoot_storage.html), might still need attention.
Pending The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster. Requested A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to Deploying
. If your cluster is stuck in theRequested
state for a long time, open an [{{site.data.keyword.Bluemix_notm}} support case](cs_troubleshoot.html#ts_getting_help).Updating The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster. Warning At least one worker node in the cluster is not available, but other worker nodes are available and can take over the workload. The Kubernetes master is the main component that keeps your cluster up and running. The master stores cluster resources and their configurations in the etcd database that serves as the single point of truth for your cluster. The Kubernetes API server is the main entry point for all cluster management requests from the worker nodes to the master, or when you want to interact with your cluster resources.
If a master failure occurs, your workloads continue to run on the worker nodes, but you cannot usekubectl
commands to work with your cluster resources or view the cluster health until the Kubernetes API server in the master is back up. If a pod goes down during the master outage, the pod cannot be rescheduled until the worker node can reach the Kubernetes API server again.
During a master outage, you can still runibmcloud ks
commands against the {{site.data.keyword.containerlong_notm}} API to work with your infrastructure resources, such as worker nodes or VLANs. If you change the current cluster configuration by adding or removing worker nodes to the cluster, your changes do not happen until the master is back up.Do not restart or reboot a worker node during a master outage. This action removes the pods from your worker node. Because the Kubernetes API server is unavailable, the pods cannot be rescheduled onto other worker nodes in the cluster. {: important}
{: #debug_worker_nodes}
Review the options to debug your worker nodes and find the root causes for failures.
- If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, review the state of your worker nodes.
ibmcloud ks workers <cluster_name_or_id>
{: pre}
- Review the
State
andStatus
field for every worker node in your CLI output.
Worker node states Worker node state Description Critical A worker node can go into a Critical state for many reasons: - You initiated a reboot for your worker node without cordoning and draining your worker node. Rebooting a worker node can cause data corruption in
containerd
,kubelet
,kube-proxy
, andcalico
. - The pods that are deployed to your worker node do not use resource limits for [memory ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/) and [CPU ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/). Without resource limits, pods can consume all available resources, leaving no resources for other pods to run on this worker node. This overcommitment of workload causes the worker node to fail.
containerd
,kubelet
, orcalico
went into an unrecoverable state after it ran hundreds or thousands of containers over time.- You set up a Virtual Router Appliance for your worker node that went down and cut off the communication between your worker node and the Kubernetes master.
- Current networking issues in {{site.data.keyword.containerlong_notm}} or IBM Cloud infrastructure (SoftLayer) that causes the communication between your worker node and the Kubernetes master to fail.
- Your worker node ran out of capacity. Check the Status of the worker node to see whether it shows Out of disk or Out of memory. If your worker node is out of capacity, consider to either reduce the workload on your worker node or add a worker node to your cluster to help load balance the workload.
If reloading the worker node does not resolve the issue, go to the next step to continue troubleshooting your worker node.
Tip: You can [configure health checks for your worker node and enable Autorecovery](cs_health.html#autorecovery). If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like an OS reload on the worker node. For more information about how Autorecovery works, see the [Autorecovery blog ![External link icon](../icons/launch-glyph.svg "External link icon")](https://www.ibm.com/blogs/bluemix/2017/12/autorecovery-utilizes-consistent-hashing-high-availability/).Deployed Updates are successfully deployed to your worker node. After updates are deployed, {{site.data.keyword.containerlong_notm}} starts a health check on the worker node. After the health check is successful, the worker node goes into a Normal
state. Worker nodes in aDeployed
state usually are ready to receive workloads, which you can check by runningkubectl get nodes
and confirming that the state showsNormal
.Deploying When you update the Kubernetes version of your worker node, your worker node is redeployed to install the updates. If you reload or reboot your worker node, the worker node is redeployed to automatically install the latest patch version. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the deployment. Normal Your worker node is fully provisioned and ready to be used in the cluster. This state is considered healthy and does not require an action from the user. **Note**: Although the worker nodes might be normal, other infrastructure resources, such as [networking](cs_troubleshoot_network.html) and [storage](cs_troubleshoot_storage.html), might still need attention. Provisioning Your worker node is being provisioned and is not available in the cluster yet. You can monitor the provisioning process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the provisioning. Provision_failed Your worker node could not be provisioned. Continue with the next step to find the details for the failure. Reloading Your worker node is being reloaded and is not available in the cluster. You can monitor the reloading process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the reloading. Reloading_failed Your worker node could not be reloaded. Continue with the next step to find the details for the failure. Reload_pending A request to reload or to update the Kubernetes version of your worker node is sent. When the worker node is being reloaded, the state changes to Reloading
.Unknown The Kubernetes master is not reachable for one of the following reasons: - You requested an update of your Kubernetes master. The state of the worker node cannot be retrieved during the update.
- You might have another firewall that is protecting your worker nodes, or changed firewall settings recently. {{site.data.keyword.containerlong_notm}} requires certain IP addresses and ports to be opened to allow communication from the worker node to the Kubernetes master and vice versa. For more information, see [Firewall prevents worker nodes from connecting](cs_troubleshoot_clusters.html#cs_firewall).
- The Kubernetes master is down. Contact {{site.data.keyword.Bluemix_notm}} support by opening an [{{site.data.keyword.Bluemix_notm}} support case](#ts_getting_help).
Warning Your worker node is reaching the limit for memory or disk space. You can either reduce work load on your worker node or add a worker node to your cluster to help load balance the work load. - List the details for the worker node. If the details include an error message, review the list of common error messages for worker nodes to learn how to resolve the problem.
ibmcloud ks worker-get <worker_id>
{: pre}
ibmcloud ks worker-get [<cluster_name_or_id>] <worker_node_id>
{: pre}
{: #common_worker_nodes_issues}
Review common error messages and learn how to resolve them.
Common error messages Error message Description and resolution {{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Your account is currently prohibited from ordering 'Computing Instances'. Your IBM Cloud infrastructure (SoftLayer) account might be restricted from ordering compute resources. Contact {{site.data.keyword.Bluemix_notm}} support by opening an [{{site.data.keyword.Bluemix_notm}} support case](#ts_getting_help). {{site.data.keyword.Bluemix_notm}} infrastructure exception: Could not place order.
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Could not place order. There are insufficient resources behind router 'router_name' to fulfill the request for the following guests: 'worker_id'.The zone that you selected might not have enough infrastructure capacity to provision your worker nodes. Or, you might have exceeded a limit in your IBM Cloud infrastructure (SoftLayer) account. To resolve, try one of the following options: - Infrastructure resource availability in zones can fluctuate often. Wait a few minutes and try again.
- For a single zone cluster, create the cluster in a different zone. For a multizone cluster, add a zone to the cluster.
- Specify a different pair of public and private VLANs for your worker nodes in your IBM Cloud infrastructure (SoftLayer) account. For worker nodes that are in a worker pool, you can use the
ibmcloud ks zone-network-set
[command](cs_cli_reference.html#cs_zone_network_set). - Contact your IBM Cloud infrastructure (SoftLayer) account manager to verify that you do not exceed an account limit, such as a global quota.
- Open an [IBM Cloud infrastructure (SoftLayer) support case](#ts_getting_help)
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: Could not obtain network VLAN with ID: <vlan id>. Your worker node could not be provisioned because the selected VLAN ID could not be found for one of the following reasons: - You might have specified the VLAN number instead of the VLAN ID. The VLAN number is 3 or 4 digits long, whereas the VLAN ID is 7 digits long. Run
ibmcloud ks vlans <zone>
to retrieve the VLAN ID. - The VLAN ID might not be associated with the IBM Cloud infrastructure (SoftLayer) account that you use. Run
ibmcloud ks vlans <zone>
to list available VLAN IDs for your account. To change the IBM Cloud infrastructure (SoftLayer) account, see [`ibmcloud ks credential-set`](cs_cli_reference.html#cs_credentials_set).
SoftLayer_Exception_Order_InvalidLocation: The location provided for this order is invalid. (HTTP 500) Your IBM Cloud infrastructure (SoftLayer) is not set up to order compute resources in the selected data center. Contact [{{site.data.keyword.Bluemix_notm}} support](#ts_getting_help) to verify that you account is set up correctly. {{site.data.keyword.Bluemix_notm}} Infrastructure Exception: The user does not have the necessary {{site.data.keyword.Bluemix_notm}} Infrastructure permissions to add servers
{{site.data.keyword.Bluemix_notm}} Infrastructure Exception: 'Item' must be ordered with permission.
The {{site.data.keyword.Bluemix_notm}} infrastructure credentials could not be validated.You might not have the required permissions to perform the action in your IBM Cloud infrastructure (SoftLayer) portfolio, or you are using the wrong infrastructure credentials. See [Setting up the API key to enable access to the infrastructure portfolio](cs_users.html#api_key). Worker unable to talk to {{site.data.keyword.containerlong_notm}} servers. Please verify your firewall setup is allowing traffic from this worker. - If you have a firewall, [configure your firewall settings to allow outgoing traffic to the appropriate ports and IP addresses](cs_firewall.html#firewall_outbound).
- Check whether your cluster does not have a public IP by running `ibmcloud ks workers <mycluster>`. If no public IP is listed, then your cluster has only private VLANs.
- If you want the cluster to have only private VLANs, set up your [VLAN connection](cs_clusters_planning.html#private_clusters) and your [firewall](cs_firewall.html#firewall_outbound).
- If you want the cluster to have a public IP, [add new worker nodes](cs_cli_reference.html#cs_worker_add) with both public and private VLANs.
Cannot create IMS portal token, as no IMS account is linked to the selected BSS account
Provided user not found or active
SoftLayer_Exception_User_Customer_InvalidUserStatus: User account is currently cancel_pending.
Waiting for machine to be visible to the userThe owner of the API key that is used to access the IBM Cloud infrastructure (SoftLayer) portfolio does not have the required permissions to perform the action, or might be pending deletion.
As the user, follow these steps:- If you have access to multiple accounts, make sure that you are logged in to the account where you want to work with {{site.data.keyword.containerlong_notm}}.
- Run
ibmcloud ks api-key-info
to view the current API key owner that is used to access the IBM Cloud infrastructure (SoftLayer) portfolio. - Run
ibmcloud account list
to view the owner of the {{site.data.keyword.Bluemix_notm}} account that you currently use. - Contact the owner of the {{site.data.keyword.Bluemix_notm}} account and report that the API key owner has insufficient permissions in IBM Cloud infrastructure (SoftLayer) or might be pending to be deleted.
As the account owner, follow these steps:- Review the [required permissions in IBM Cloud infrastructure (SoftLayer)](cs_users.html#infra_access) to perform the action that previously failed.
- Fix the permissions of the API key owner or create a new API key by using the [
ibmcloud ks api-key-reset
](cs_cli_reference.html#cs_api_key_reset) command. - If you or another account admin manually set IBM Cloud infrastructure (SoftLayer) credentials in your account, run [
ibmcloud ks credential-unset
](cs_cli_reference.html#cs_credentials_unset) to remove the credentials from your account.
{: #debug_apps}
Review the options that you have to debug your app deployments and find the root causes for failures.
- Look for abnormalities in the service or deployment resources by running the
describe
command.
Example:
kubectl describe service <service_name>
-
Check whether the containers are stuck in the ContainerCreating state.
-
Check whether the cluster is in the
Critical
state. If the cluster is in aCritical
state, check the firewall rules and verify that the master can communicate with the worker nodes. -
Verify that the service is listening on the correct port.
- Get the name of a pod.
kubectl get pods
- Log in to a container.
kubectl exec -it <pod_name> -- /bin/bash
- Curl the app from within the container. If the port is not accessible, the service might not be listening on the correct port or the app might have issues. Update the configuration file for the service with the correct port and redeploy or investigate potential issues with the app.
curl localhost: <port>
-
Verify that the service is linked correctly to the pods.
- Get the name of a pod.
kubectl get pods
- Log in to a container.
kubectl exec -it <pod_name> -- /bin/bash
- Curl the cluster IP address and port of the service. If the IP address and port are not accessible, look at the endpoints for the service. If no endpoints are listed, then the selector for the service does not match the pods. If endpoints are listed, then look at the target port field on the service and make sure that the target port is the same as what is being used for the pods.
curl <cluster_IP>:<port>
-
For Ingress services, verify that the service is accessible from within the cluster.
- Get the name of a pod.
kubectl get pods
- Log in to a container.
kubectl exec -it <pod_name> -- /bin/bash
- Curl the URL specified for the Ingress service. If the URL is not accessible, check for a firewall issue between the cluster and the external endpoint.
curl <host_name>.<domain>
{: #ts_getting_help}
Still having issues with your cluster? {: shortdesc}
- In the terminal, you are notified when updates to the
ibmcloud
CLI and plug-ins are available. Be sure to keep your CLI up-to-date so that you can use all the available commands and flags. - To see whether {{site.data.keyword.Bluemix_notm}} is available, check the {{site.data.keyword.Bluemix_notm}} status page .
- Post a question in the {{site.data.keyword.containerlong_notm}} Slack . If you are not using an IBM ID for your {{site.data.keyword.Bluemix_notm}} account, request an invitation to this Slack. {: tip}
- Review the forums to see whether other users ran into the same issue. When you use the forums to ask a question, tag your question so that it is seen by the {{site.data.keyword.Bluemix_notm}} development teams.
- If you have technical questions about developing or deploying clusters or apps with {{site.data.keyword.containerlong_notm}}, post your question on Stack Overflow and tag your question with
ibm-cloud
,kubernetes
, andcontainers
. - For questions about the service and getting started instructions, use the IBM Developer Answers forum. Include the
ibm-cloud
andcontainers
tags. See Getting help for more details about using the forums.
- If you have technical questions about developing or deploying clusters or apps with {{site.data.keyword.containerlong_notm}}, post your question on Stack Overflow and tag your question with
- Contact IBM Support by opening a case. To learn about opening an IBM support case, or about support levels and case severities, see Contacting support.
When you report an issue, include your cluster ID. To get your cluster ID, run
ibmcloud ks clusters
. {: tip}