-
Notifications
You must be signed in to change notification settings - Fork 68
/
checklist.html.md.erb
373 lines (262 loc) · 17.5 KB
/
checklist.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition
owner: TKGI
---
This topic describes the preparation steps to complete before upgrading
VMware Tanzu Kubernetes Grid Integrated Edition (TKGI) from <%= vars.product_version_prev %> to <%= vars.product_version %>.
##<a id='overview'></a> Overview
The following are the procedures that you must complete before beginning your TKGI upgrade.
<p class="note warning"><strong>Warning</strong>: Failure to follow these instructions
might jeopardize your existing deployment data and cause the TKGI upgrade to fail.
</p>
To prepare for a TKGI Upgrade:
1. [Back Up Your TKGI Deployment](#backup)
1. [Review What Happens During TKGI Upgrades](#understand-upgrades)
1. [Review Changes in TKGI](#review-changes)
1. [Determine Upgrade Order (vSphere Only)](#upgrade-order)
<% if vars.product_version == "v1.19" %>
1. [Migrate Workloads Off of Flannel Clusters](#off-flannel)
<% end %>
1. [Set User Expectations and Restrict Cluster Access](#expectations)
1. If you are upgrading a cluster that uses a public cloud CSI driver,
see [Limitations on Using a Public Cloud CSI Driver](release-notes.html#1-15-0-csi-driver-limits)
in _Release Notes_ for additional requirements.
1. [Upgrade All Clusters to <%= vars.product_version_prev %>](#upgrade-clusters)
1. [Verify Your Clusters Support Upgrading](#resource-usage)
1. [Verify Health of Kubernetes Environment](#verify-k8s-health)
1. [Verify Your Environment Configuration](#review-configurations)
1. [Clean Up or Fix Failed Kubernetes Clusters](#clean-up)
1. [Verify Kubernetes Clusters Have Unique External Hostnames](#unique-hostname)
1. [Verify TKGI Proxy Configuration](#tkgi-proxy)
1. [Check PodDisruptionBudget Value](#check-poddisruptionbudget-value)
1. [(Optional) Configure Node Drain Behavior]()
After completing the steps in this topic, continue to
[Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea Networking)](upgrade.html) or
[Upgrading Tanzu Kubernetes Grid Integrated Edition (NSX Networking)](upgrade-nsxt.html).
##<a id='backup'></a> Back Up Your Tanzu Kubernetes Grid Integrated Edition Deployment
<%= vars.recommended_by %> recommends backing up your Tanzu Kubernetes Grid Integrated Edition
deployment and workloads before upgrading.
To back up Tanzu Kubernetes Grid Integrated Edition, see
[Backing Up and Restoring Tanzu Kubernetes Grid Integrated Edition](backup-and-restore.html).
##<a id='understand-upgrades'></a> Review What Happens During Tanzu Kubernetes Grid Integrated Edition Upgrades
If you have not already done so, review [About Tanzu Kubernetes Grid Integrated Edition Upgrades](understanding-upgrades.html).
Plan your upgrade based on your workload capacity and uptime requirements.
##<a id='review-changes'></a> Review Changes in Tanzu Kubernetes Grid Integrated Edition <%= vars.product_version %>
Review the [Release Notes](release-notes.html) for Tanzu Kubernetes Grid Integrated Edition <%= vars.product_version %>.
##<a id='upgrade-order'></a> Determine Upgrade Order (vSphere Only)
To determine the upgrade order for your Tanzu Kubernetes Grid Integrated Edition environment,
review
[Upgrade Order for Tanzu Kubernetes Grid Integrated Edition Environments on vSphere](upgrade-scenarios.html).
<% if vars.product_version == "v1.19" %>
##<a id='off-flannel'></a> Migrate Workloads Off of Flannel Clusters
Antrea is the Container Networking Interface (CNI) for TKGI <%= vars.product_version %>. Flannel CNI is no longer supported.
If you have clusters running with Flannel CNI on TKGI 1.18 or earlier, create clusters with Antrea CNI and move all workloads from Flannel clusters to Antrea clusters before you upgrade TKGI to 1.19.
- Clusters with Flannel CNI continue to run after TKGI upgrade to 1.19, but cannot be updated or upgraded.
For more information about Flannel CNI removal, see <a href="https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid-Integrated-Edition/1.18/tkgi/GUID-understanding-upgrades.html#upgrade-the-cni">About Switching from the Flannel CNI to the Antrea CNI</a> in the TKGI 1.18 documentation.
<% end %>
##<a id='expectations'></a> Set User Expectations and Restrict Cluster Access
Coordinate the Tanzu Kubernetes Grid Integrated Edition upgrade with cluster admins and users.
During the upgrade:
* Their workloads will remain active and accessible.
* They will be unable to perform cluster management functions,
including creating, resizing, updating, and deleting clusters.
* They will be unable to log in to TKGI or use the TKGI CLI and other TKGI control plane services.
<p class="note"><strong>Note</strong>: Do not start any cluster management tasks right before an upgrade.
Wait for cluster operations to complete before upgrading.
</p>
##<a id='upgrade-clusters'></a> Upgrade All Clusters to <%= vars.product_version_prev %>
Tanzu Kubernetes Grid Integrated Edition <%= vars.product_version %> does not support clusters
running versions of TKGI earlier than <%= vars.product_version_prev %>.
Before you upgrade from Tanzu Kubernetes Grid Integrated Edition <%= vars.product_version_prev %> to <%= vars.product_version %>,
you must upgrade all of your TKGI-provisioned clusters
to <%= vars.product_version_prev %>.
To upgrade TKGI-provisioned clusters:
1. Check the version of your clusters:
```
tkgi clusters
```
1. If one or more of your clusters are running a version of TKGI
earlier than <%= vars.product_version_prev %>:
1. Verify these clusters support being upgraded to TKGI <%= vars.product_version_prev %>.
For more information, see [Verify Your Clusters Support Upgrading](#resource-usage) below.
1. Upgrade the clusters to TKGI <%= vars.product_version_prev %>.
For instructions,
see the [Upgrading Clusters](https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid-integrated-edition/<%= vars.product_version_prev_raw %>/tkgi/upgrade-clusters.html) topic
in the TKGI <%= vars.product_version_prev %> documentation.
##<a id='resource-usage'></a> Verify Your Clusters Support Upgrading
It is critical that you confirm that a cluster's resource usage is within the
recommended maximum limits before upgrading the cluster.
VMware Tanzu Kubernetes Grid Integrated Edition upgrades a cluster by upgrading control plane and worker nodes individually.
The upgrade processes a control plane node by redistributing the node's workload, stopping the node, upgrading it and restoring its workload.
This redistribution of a node's workloads increases the resource usage on the remaining nodes during the upgrade process.
If a Kubernetes cluster control plane VM is operating too close to capacity, the upgrade can fail.
<p class="note warning"><strong>Warning</strong>: Downtime is required to repair a cluster failure
resulting from upgrading an overloaded Kubernetes cluster control plane VM.
</p>
To prevent workload downtime during a cluster upgrade, complete the following before upgrading a cluster:
1. Ensure none of the control plane VMs being upgraded will become overloaded during the cluster upgrade.
See [Control Plane Node VM Size](vm-sizing.html#master-sizing) for more information.
1. Review the cluster's workload resource usage.
1. Scale up the cluster if it is near capacity on its existing infrastructure.
1. If you are updating a cluster that uses a public cloud CSI driver,
see [Limitations on Using a Public Cloud CSI Driver](release-notes.html#1-15-0-csi-driver-limits)
in _Release Notes_ for additional requirements.
1. Scale up your cluster by running the command below or create a new cluster
using a larger plan. For more information, see [Changing Cluster Configurations](scale-clusters.html).
```
tkgi update-cluster CLUSTER-NAME --num-nodes NUMBER-OF-WORKER-NODES
```
Where:
* `CLUSTER-NAME` is the name of your cluster.
* `NUMBER-OF-WORKER-NODES` is the number of worker nodes that you want to set for the cluster.
<p class="note"><strong>Note</strong>: VMware recommends that you avoid using the
<code>tkgi resize</code> command to perform resizing operations.</p>
1. Run the cluster's workloads on at least three
worker VMs using multiple replicas of your workloads spread across those VMs.
For more information, see [Maintaining Workload Uptime](maintain-uptime.html).
##<a id='verify-k8s-health'></a> Verify Health of Kubernetes Environment
Verify that your Kubernetes environment is healthy.
To verify the health of your Kubernetes environment, see [Verifying Deployment Health](verify-health.html).
##<a id='review-configurations'></a> Verify Your Environment Configuration
If you are upgrading Tanzu Kubernetes Grid Integrated Edition,
verify the configuration of your environment supports the TKGI version you are installing:
* [Verify Your vSphere with NSX Configuration](#review-vsphere-nsxt)
* [Verify Your Antrea Environment Configuration](#review-non-nsxt)
###<a id='review-vsphere-nsxt'></a> Verify Your vSphere with NSX Configuration
If you are upgrading Tanzu Kubernetes Grid Integrated Edition for environments using vSphere with NSX, perform the following steps:
1. Verify that the vSphere datastores have enough space.
1. Verify that the vSphere hosts have enough memory.
1. Verify that there are no alarms in vSphere.
1. Verify that the vSphere hosts are in a good state.
1. Verify that NSX Edge is configured for high availability.
<p class="note"><strong>Note</strong>: Workloads in your Kubernetes cluster are unavailable while
the NSX Edge nodes run the upgrade unless you configure NSX Edge for high availability. For more
information, see the <a href="./nsxt-prepare-env.html#nsx-edge-ha">Configure NSX Edge for High Availability (HA)</a>
section of <em>Preparing NSX Before Deploying Tanzu Kubernetes Grid Integrated Edition</em>.</p>
###<a id='review-non-nsxt'></a> Verify Your Antrea Environment Configuration
If you are upgrading Tanzu Kubernetes Grid Integrated Edition in an environment using Antrea networking,
perform the following steps:
1. Verify the 6081 UDP port is open on all worker node VMs.
1. Verify the 8091 TCP port is open on all control plane node VMs.
1. Verify your environment configuration meets the Antrea networking requirements.
For more information, see [Network Requirements](https://github.com/antrea-io/antrea/blob/main/docs/network-requirements.md#network-requirements)
in the Antrea GitHub repository.
<p class="note"><strong>Note</strong>: Port 6081 must be open on all of the worker node VMs and
port 8091 must be open on all control plane node VMs in the clusters you create in an Antrea networking environment.</p>
##<a id='clean-up'></a> Clean Up or Fix Failed Kubernetes Clusters
Clean up or fix any previous failed attempts to create TKGI clusters
with the TKGI Command Line Interface
(TKGI CLI) by performing the following steps:
1. View your deployed clusters by running the following command:
```
tkgi clusters
```
If the `Status` of any cluster displays as `FAILED`, continue to the next step. If no cluster
displays as `FAILED`, no action is required. Continue to the next section.
1. To troubleshoot and fix failed clusters, perform the procedure in [Cluster Creation Fails](troubleshoot-issues.html#cluster-create-fail).
1. To clean up failed BOSH deployments related to failed clusters, perform the procedure in [Cannot Re-Create a Cluster that Failed to Deploy](troubleshoot-issues.html#cluster-recreate-fails).
1. After fixing and cleaning up any failed clusters, view your deployed clusters again by running `tkgi clusters`.
For more information about troubleshooting and fixing failed clusters, see the [Knowledge Base](https://support.broadcom.com/).
##<a id='unique-hostname'></a> Verify Kubernetes Clusters Have Unique External Hostnames
Verify that existing Kubernetes clusters have unique external hostnames by checking for multiple
Kubernetes clusters with the same external hostname. Perform the following steps:
1. Log in to the TKGI CLI. For more information, see
[Logging in to Tanzu Kubernetes Grid Integrated Edition](login.html). You must log in with an account that has the
UAA scope of `pks.clusters.admin`. For more information about UAA scopes, see
[Managing Tanzu Kubernetes Grid Integrated Edition Users with UAA](manage-users.html).
1. View your deployed TKGI clusters by running the following command:
```
tkgi clusters
```
1. For each deployed cluster, run `tkgi cluster CLUSTER-NAME` to view the details of the cluster.
For example:
```console
$ tkgi cluster my-cluster
```
Examine the output to verify that the `Kubernetes Master Host` is unique for each cluster.
##<a id='tkgi-proxy'></a> Verify TKGI Proxy Configuration
Verify your current TKGI proxy configuration by performing the following steps:
1. Check whether an existing proxy is enabled:
1. Log in to Ops Manager.
1. Click the **VMware Tanzu Kubernetes Grid Integrated Edition** tile.
1. Click **Networking**.
1. If **HTTP/HTTPS Proxy** is **Disabled**, no action is required. Continue to the next section.
If **HTTP/HTTPS Proxy** is **Enabled**, continue to the next step.
1. Verify the **No Proxy** field values do not contain an underscore character, for example, `my_host.mydomain.com`.
<p class="note warning"><strong>Warning</strong>: An underscore character in the <strong>No Proxy</strong> field can cause your upgrade to fail.
If an existing <strong>No Proxy</strong> field value contains an underscore character, or you plan to add a value containing an underscore, contact Support.</p>
##<a id='check-poddisruptionbudget-value'></a> Check PodDisruptionBudget Value
Tanzu Kubernetes Grid Integrated Edition upgrades can run without ever completing if any Kubernetes app has a `PodDisruptionBudget`
with `maxUnavailable` set to `0`.
To ensure that no apps have a `PodDisruptionBudget` with
`maxUnavailable` set to `0`:
1. Run the following `kubectl` command to verify the `PodDisruptionBudget` as the cluster administrator:
```
kubectl get poddisruptionbudgets --all-namespaces
```
1. Examine the output to verify that no app displays `0` in the `MAX UNAVAILABLE` column.
## <a id="configure-node-drain"></a> (Optional) Configure Node Drain Behavior
During the Tanzu Kubernetes Grid Integrated Edition upgrade process, worker nodes are cordoned and drained.
Workloads can prevent worker nodes from draining and cause the upgrade to fail or hang.
To prevent hanging cluster upgrades, you can configure default node drain behavior using the following methods:
* [Configure with the TKGI Tile](#node-drain-tile)
* [Configure with the TKGI CLI](#node-drain-cli)
The new default behavior takes effect during the next upgrade,
not immediately after configuring the behavior.
### <a id="node-drain-tile"></a> Configure with the TKGI Tile
To configure node drain behavior in the Tanzu Kubernetes Grid Integrated Edition tile,
see <a href="./troubleshoot-issues.html#upgrade-drain-hangs">Worker Node Hangs Indefinitely</a>
in <i>Troubleshooting</i>.</p>
### <a id="node-drain-cli"></a> Configure with the TKGI CLI
To configure default node drain behavior with the TKGI CLI:
1. View the current node drain behavior by running the following command:
```
tkgi cluster CLUSTER-NAME --details
```
Where `CLUSTER-NAME` is the name of your cluster.
<br><br>
For example:
```
$ tkgi cluster my-cluster --details
Name: my-cluster
Plan Name: small
UUID: f55ed6c4-c0a7-451d-b735-56c89fdb2ad7
Last Action: CREATE
Last Action State: succeeded
Last Action Description: Instance provisioning completed
Kubernetes Master Host: my-cluster.tkgi.local
Kubernetes Master Port: 8443
Worker Nodes: 3
Kubernetes Master IP(s): 10.196.219.88
Network Profile Name:
Kubernetes Settings Details:
Set by Cluster:
Kubelet Node Drain timeout (mins) (kubelet-drain-timeout): 10
Kubelet Node Drain grace-period (mins) (kubelet-drain-grace-period): 10
Kubelet Node Drain force (kubelet-drain-force): true
Set by Plan:
Kubelet Node Drain force-node (kubelet-drain-force-node): true
Kubelet Node Drain ignore-daemonsets (kubelet-drain-ignore-daemonsets): true
Kubelet Node Drain delete-local-data (kubelet-drain-delete-local-data): true
```
1. If you are updating a cluster that uses a public cloud CSI driver,
see [Limitations on Using a Public Cloud CSI Driver](release-notes.html#1-15-0-csi-driver-limits)
in _Release Notes_ for additional requirements.
1. Configure the default node drain behavior by running the following command:
```
tkgi update-cluster CLUSTER-NAME FLAG
```
Where:
* `CLUSTER-NAME` is the name of your cluster.
* `FLAG` is an action flag for updating the node drain behavior.
For example:
```console
$ tkgi update-cluster my-cluster --kubelet-drain-timeout 1 --kubelet-drain-grace-period 5
Update summary for cluster my-cluster:
Kubelet Drain Timeout: 1
Kubelet Drain Grace Period: 5
Are you sure you want to continue? (y/n): y
Use 'tkgi cluster my-cluster' to monitor the state of your cluster
```
For a list of the available action flags for setting node drain behavior,
see [tkgi update-cluster](cli/index.html#update-cluster) in _TKGI CLI_.