Here are some pointers.
The main workflow while upgrading the entire stack is:
- upgrade NSX-T
- upgrade vCenter server
- upgrade ESXi nodes
- upgrade vSAN and vDS (if required)
- upgrade WCP
Note: Make sure you follow the compatibility matrix while performing upgrades in a production environment.
NSX-T
While NSX-T is getting upgraded, the host transport nodes will also need to be updated. NSX components will be installed/ updated on the ESXi nodes and may need a reboot also.
In NSX-T: System > Lifecycle Management > Upgrade > Upgrade NSX > Hosts
Plan
-Upgrade order across groups - Serial
-Pause upgrade condition - When an upgrade unit fails to upgrade
Host Groups
-Upgrade order within group - Serial
-Upgrade mode - Maintenance
Notes:
- There is an in-place upgrade mode for the host groups, and in this mode the NSX component on the ESXi nodes will get updated first and then the node will be placed in MM and then gets rebooted. But we have had some cases where the NSX component update on the ESXi node fails, and the workload/ TKC VMs running on that ESXi loses network connectivity. To fix this the ESXi node has to be rebooted, and while trying to put the node in MM, the TKC VMs running on it fails to get migrated, and the final option is to force reboot the ESXi and that affects the workload running on it. To avoid this case, the safest option is to choose the upgrade mode as Maintenance, so that the TKCs VMs will get migrated off the ESXi node first before updating the NSX component, and even if the component update fails, we are safe to reboot the node as it is in already in MM.
- We have also noticed in some cases the ESXi nodes in a WCP cluster fails to enter MM. Following are some common cases:
- Orphaned or inaccessible VMs are one of the main causes. You can follow this article to find those and clean them up.
- I have noticed cases like some VMs (vSphere pods) were present on the ESXi host in poweredoff state. In this case the ESXi was stuck 100% in maintenance mode. You will need to check on those VMs and then delete them if required to proceed further.
- There are cases where some TKC VMs were present on the ESXi host and they fail to get migrated. Check whether these VMs are still present at the Kubernetes layer. If not, delete those VMs!
- Check if there are any TKC nodes still present on the ESXi node. There can be cases where the TKC node or nodes are unable to migrate to other available nodes in the cluster due to resource constraints. Check whether the TKC nodes that are unable to migrate is using guaranteed vmclass. The solution is to find out unused resources/ TKCs in the cluster, check with the owner/ user and then delete it if not being used. Another way is to check with the user and change the guaranteed vmclass to besteffort if possible or temporarily reduce the number of worker nodes if the user agrees to do so.
- Verify whether any pods are running on the respective ESXi worker node:
Safely drain the node.
kubectl get pods -A -o wide | grep ESXi-FQDN
Hope it was useful. Cheers!