vineethac.blogspot.com

Sunday, September 11, 2022

vSphere with Tanzu using NSX-T - Part19 - Troubleshooting TKC stuck at creating phase

This article provides basic troubleshooting steps for TKCs (Tanzu Kubernetes Cluster) stuck at creating phase.

Verify status of the TKC

Use the following commands to verify the TKC status.

kubectl get tkc -n <supervisor_namespace>
kubectl get tkc -n <supervisor_namespace> -o json
kubectl describe tkc <tkc_name> -n <supervisor_namespace>
kubectl get cluster-api -n <supervisor_namespace>
kubectl get vm,machine,wcpmachine -n <supervisor_namespace>

Cluster health

Verify health of the supervisor cluster.

❯ kubectl get node
NAME                               STATUS   ROLES                  AGE   VERSION
4201a7b2667b0f3b021efcf7c9d1726b   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
4201bead67e21a8813415642267cd54a   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
4201e0e8e29b0ddb4b59d3165dd40941   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
wxx-08-r02esx13.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx14.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx15.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx16.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx17.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx18.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx19.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx20.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx21.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx22.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx23.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx24.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
❯
❯ kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

Terminating namespaces

Check for namespaces stuck at terminating phase. If there are any, properly clean them up by removing all child objects.
You can use this kubectl get-all plugin to see all resources under a namespace. Then clean them up properly. Mostly you need to set finalizers of remaining child resources to null. Following is a sample case where 2 PVCs where stuck at terminating and they were cleaned up by setting its finalizers to null.

❯ kg ns | grep Terminating
rgettam-gettam                                  Terminating   226d
❯
❯ k get-all -n rgettam-gettam
NAME                                                                                             NAMESPACE       AGE
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d  rgettam-gettam  86d
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f  rgettam-gettam  86d
❯
❯ kg pvc -n rgettam-gettam
NAME                                                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d   Terminating   pvc-bd4252fb-bfed-4ef3-ab5a-43718f9cbed5   8Gi        RWO            sxx-01-vcxx-wcp-mgmt   86d
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f   Terminating   pvc-8bc9daa1-21cf-4af2-973e-af28d66a7f5e   30Gi       RWO            sxx-01-vcxx-wcp-mgmt   86d
❯
❯ kg pvc -n rgettam-gettam --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n rgettam-gettam pvc {} -p '{"metadata":{"finalizers": null}}'

You can also do kubectl get namespace <namespace> -oyaml and the status section will show if there are resources/ content to be deleted or any finalizers remaining.
Verify vmop-controller pod logs, and restart them if required.

IP_BLOCK_EXHAUSTED

Check CIDR usage of the supervisor cluster.

❯ kg clusternetworkinfos
NAME                                                AGE
domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4   160d
❯
❯ kg clusternetworkinfos domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4 -o json | jq .usage
{
  "egressCIDRUsage": {
    "allocated": 33,
    "total": 1024
  },
  "ingressCIDRUsage": {
    "allocated": 42,
    "total": 1024
  },
  "subnetCIDRUsage": {
    "allocated": 832,
    "total": 1024
  }
}

When the IP blocks of supervisor cluster are exhausted, you will find the following warning when you describe the TKC.

 Conditions:
    Last Transition Time:  2022-10-05T18:34:35Z
    Message:               Cannot realize subnet
    Reason:                ClusterNetworkProvisionFailed
    Severity:              Warning
    Status:                False
    Type:                  Ready

Also when you check the namespace, you can see the following ncp error IP_BLOCK_EXHAUSTED.

 ❯ kg ns tsql-integration-test -oyaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    calaxxxx.xxxyy.com/xxxrole-created: "1"
    ncp/error: IP_BLOCK_EXHAUSTED
    ncp/router_id: t1_d0a2af0f-8430-4250-9fcf-807a4afe51aa_rtr
    vmware-system-resource-pool: resgroup-307480
    vmware-system-vm-folder: group-v307481
  creationTimestamp: "2022-10-05T17:35:18Z"

Notes:

If the subnetCIDRUsage IP block is exhausted, you may need to remove some old/ unused namespaces, and that will release some IPs. If that is not possible, you may need to consider adding new subnet.

After removing the old/ unused namespaces, and even if IPs are available, sometimes the TKCs will be stuck at creating phase! In that case, check the ncp, vmop, and capw controller pods and you may need to restart them. What I observed is usually after restart of ncp pod, vmop-controller pods, and all pods under vmware-system-capw namespaces the VMs will start getting deployed and the TKC creation will progress and complete successfully.

Resource availability

Check whether there are enough resources available in the cluster.

LAST SEEN  TYPE   REASON       OBJECT                    MESSAGE
3m23s    Warning  UpdateFailure   virtualmachine/magna3-control-plane-9rhl4   The host does not have sufficient CPU resources to satisfy the reservation.
80s     Warning  ReconcileFailure  wcpmachine/magna3-control-plane-s5s9t-p2cxj  vm is not yet powered on: vmware-system-capw-controller-manager/WCPMachine//chakravartha-magna3/magna3/magna3-control-plane-s5s9t-p2cxj

Check for resource limits applied to the namespace.

Check whether storage policy is assigned to the namespace

27m         Warning   ReconcileFailure               wcpmachine/gc-pool-0-cv8vz-5snbc          admission webhook "default.validating.virtualmachine.vmoperator.xxxyy.com" denied the request: StorageClass wdc-10-vc21c01-wcp-pod is not assigned to any ResourceQuotas in namespace mpereiramaia-demo2

In this case, the storage policy wasnt assigned to the ns. I assigned the storage policy wdc-10-vc21c01-wcp-pod to the respective namespace, and the TKC deployment was successful.

Check Content library can sync properly

Sometimes issues related to CL can cause TKCs to get stuck at creating phase! Check this blog post for more details.

KCP can't remediate

Message:               KCP can't remediate if current replicas are less or equal then 1
Reason:                WaitingForRemediation @ Machine/gc-control-plane-zpssc
Severity:              Warning

In this case, you can just edit the TKC spec, change the control plane vmclass to a different class and save. Once the deployment is complete and TKC is running, edit the TKC spec again and revert the vmclass that you modified earlier to its original class. This process will re-provision the control plane.

TKC VMs waiting for IP

In this case, take a look at NSXT and check whether all Edge nodes are healthy. If there are mismatch errors, resolve them.
You may also check ncp pod logs and restart ncp pod if required.

VirtualMachineClassBindingNotFound

Conditions:
    Last Transition Time:  2021-05-05T18:19:10Z
    Message:               1 of 2 completed
    Reason:                VirtualMachineClassBindingNotFound @ Machine/tkc-dev-control-plane-wxd57
    Severity:              Error
    Status:                False
    Message:               0/1 Control Plane Node(s) healthy. 0/2 Worker Node(s) healthy
Events:
  Normal  PhaseChanged  7m22s  vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller  cluster changes from creating phase to failed phase

This happens when the virtualmachineclassbindings are missing and can be resolved by adding all/ required VM Class to the Namespace using the vSphere Client. Following are the steps to add VM Classes to a namespace:

Log into vCenter web UI
From Hosts and Clusters > Select the namespace > Summary tab > VM Service tile > Click Manage VM Classes
Select all required VM Classes and click OK

Verify NSX-T objects

Issues at the NSX-T side can also cause the TKC to be stuck at creating phase. Following is a sample case and you can see these logs when you describe the TKC:

Message: 2 errors occurred:

   * failed to configure DNS for /, Kind= namespace-test-01/gc: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found
   * failed to configure kube-proxy for /, Kind= namespace-test-01/gc: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found

In this case, these were some issues with the virtual servers in loadbalancer. Some stale entries of virtual servers were still present and their IP didn't get removed properly and it was causing some intermittent connectivity issues to some of the other services of type loadbalancer. And, new TKC deployment within that affected namespace also gets stuck due to this. In our case we deleted the affected namespace, and recreated it, that cleaned up all those virtual server state entries and the load balancer, and new TKC deployments were successful. So it will be worth to check on the health and staus of NSX-T objects in case you have TKC deployment issues.

Check for broken TKCs in the cluster

Sometimes the TKC deployments are very slow and takes more than 30 minutes. In this case, you may notice that the first control plane VM will get deployed in like 30-45 minutes after the TKC creation has started. Look for vmop controller logs. Following is sample log:

❯ kail -n vmware-system-vmop
vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:44.725620       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.9.212:6443: connect: connection refused" "vmName"="ciroscosta-cartographer/kontinue-control-plane-svlk4" "result"=-1

vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:49.888653       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.2.66:6443: connect: connection refused" "vmName"="whaozhe-platform/gc-control-plane-mf4p5" "result"=-1

In the above case, two of the TKCs were broken/ stuck at updating phase and we were unable to connect to its control plane.

ciroscosta-cartographer    kontinue    updating       2021-10-29T18:47:46Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     2
whaozhe-platform           gc          updating       2022-01-27T03:59:31Z   v1.20.12+vmware.1-tkg.1.b9a42f3   1     10

After removing the namespaces with broken TKCs, new deployments were completing succesfully.

Restart system pods

Sometimes restart of some of the system controller pods resoves the issue. I usually delete all the pods of the following namespaces and they will get restarted in a few seconds.

k delete pod --all --namespace=vmware-system-vmop
k delete pod --all --namespace=vmware-system-capw
k delete pod --all --namespace=vmware-system-tkg
k delete pod --all --namespace=vmware-system-csi
k delete pod --all --namespace=vmware-system-nsx

Hope this was useful. Cheers!

Saturday, August 13, 2022

vSphere with Tanzu using NSX-T - Part18 - Troubleshooting vSphere pods with ProviderFailed status

In this article, we will take a look at fixing vSphere pods with ProviderFailed status. Following is an example:

svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-5gn2n                    0/1     ProviderFailed     0          2d14h
svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-5jtvj                    0/1     ProviderFailed     0          2d13h
svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-5whtt                    0/1     ProviderFailed     0          2d14h
svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-6p2zv                    0/1     ProviderFailed     0          2d13h
svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-7r92p                    0/1     ProviderFailed     0          2d14h

When describing the pod, you can see the message "Unable to find backing for logical switch".

❯ 
❯ kd po gatekeeper-controller-manager-5ccbc7fd79-5gn2n -n svc-opa-gatekeeper-domain-c61
Name:                 gatekeeper-controller-manager-5ccbc7fd79-5gn2n
Namespace:            svc-opa-gatekeeper-domain-c61
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 esx-1.sddc-35-82-xxxxx.xxxxxxx.com/
Labels:               control-plane=controller-manager
                      gatekeeper.sh/operation=webhook
                      gatekeeper.sh/system=yes
                      pod-template-hash=5ccbc7fd79
Annotations:          attachment_id: 668b681b-fef6-43e5-8009-5ac8deb6da11
                      kubernetes.io/psp: wcp-default-psp
                      mac: 04:50:56:00:08:1e
                      vlan: None
                      vmware-system-ephemeral-disk-uuid: 6000C297-d1ba-ce8c-97ba-683a3c8f5321
                      vmware-system-image-references: {"manager":"gatekeeper-111fd0f684141bdad12c811b4f954ae3d60a6c27-v52049"}
                      vmware-system-vm-moid: vm-89777:750f38c6-3b0e-41b7-a94f-4d4aef08e19b
                      vmware-system-vm-uuid: 500c9c37-7055-1708-92d4-8ffdf932c8f9
Status:               Failed
Reason:               ProviderFailed
Message:              Unable to find backing for logical switch 03f0dcd4-a5d9-431e-ae9e-d796ddca0131: timed out waiting for the condition Unable to find backing for logical switch: 03f0dcd4-a5d9-431e-ae9e-d796ddca0131
IP:
IPs:                  <none>

A workaround for this is to restart the spherelet service on the ESXi host where you see this issue. If there are multiple ESXi nodes having same issue, you could consider restarting the spherelet service on all ESXi worker nodes. In a production setup you may want to consider placing the ESXi in maintenance mode before restarting the spherelet service. In my case, we usually restart the spherelet service directly without placing the ESXi in MM. Following is the PowerCLI way to check/ restart spherelet service on ESXi worker nodes:

> Connect-VIServer wdc-10-vc21

> Get-VMHost | Get-VMHostService | where {$_.Key -eq "spherelet"}  | select VMHost,Key,Running | ft

VMHost                        Key       Running
------                        ---       -------
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True

> $sphereletservice = Get-VMHost wdc-10-r0xxxxxxxxxxxxxxxxxxxx | Get-VMHostService | where {$_.Key -eq "spherelet"}
> Stop-VMHostService -HostService $sphereletservice

Perform operation?
Perform operation Stop host service. on spherelet?
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): Y

Key                  Label                          Policy     Running  Required
---                  -----                          ------     -------  --------
spherelet            spherelet                      on         False    False

> Get-VMHost wdc-10-r0xxxxxxxxxxxxxxxxxxxx | Get-VMHostService | where {$_.Key -eq "spherelet"}

Key                  Label                          Policy     Running  Required
---                  -----                          ------     -------  --------
spherelet            spherelet                      on         False    False

> Start-VMHostService -HostService $sphereletservice

Key                  Label                          Policy     Running  Required
---                  -----                          ------     -------  --------
spherelet            spherelet                      on         True     False

To restart spherelet service on all ESXi worker nodes of a cluster:

> Get-Cluster

Name                           HAEnabled  HAFailover DrsEnabled DrsAutomationLevel
                                          Level
----                           ---------  ---------- ---------- ------------------
wdc-10-vcxxc01                 True       1          True       FullyAutomated

> Get-Cluster -Name wdc-10-vcxxc01 | Get-VMHost | foreach { Restart-VMHostService -HostService ($_ | Get-VMHostService | where {$_.Key -eq "spherelet"}) }

After restarting the spherelet service, new pods will come up fine and be in Running status. But you may need to clean up all those pods with ProviderFailed status using kubectl.

kubectl get pods -A | grep ProviderFailed | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod

Hope it was useful. Cheers!

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Ideally if everything goes well the TKCs (Tanzu Kubernetes Cluster aka Guest Cluster) should be in running phase. But sometimes due to several reasons it may be stuck at updating phase. In this article, we will take a sample case and look at troubleshooting/ fixing it.

Following is an example:

NAMESPACE              NAME                    PHASE      CREATIONTIME           VERSION                           CP    WORKER
karvea-vc17ns11        sc201vc17pace           updating   2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Lets connect to this TKC. Here I have a small plugin (kubectl-gckc) that generates the TKC kubeconfig and gcc is alias to KUBECONFIG=gckubeconfig, where gckubeconfig is the TKC admin kubeconfig file.

❯ k gckc karvea-vc17ns11 sc201vc17pace
❯ gcc kg no
NAME                                           STATUS                     ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                      control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                      <none>                 139d   v1.20.9+vmware.1

❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    poweredOn    189d
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    poweredOn    189d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d



❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    vsphere://42010982-8b25-ad7b-2a1d-bb949def4834   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1

As you can see above, there are two worker machines that are stuck at Deleting phase. It is because the corresponding two worker nodes are at Ready, SchedulingDisabled status. The nodes are not drained yet due to some reason. Once they get drained properly, its status will be changed to NotReady, SchedulingDisabled. Now lets try to drain those worker nodes manually.

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
cannot delete Pods with local storage (use --delete-emptydir-data to override): nsxi-platform/kafka-2
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
❯ gcc kg pdb
No resources found in default namespace.
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   kafka       N/A             1                 0                     188d
nsxi-platform   zookeeper   N/A             1                 1                     188d

Here this worker node sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz is not getting drained because of the presence of a pod disruption budget (pdb). So, in-order to drain the node, I am taking a back up of the pdb yaml file and delete it. And once the nodes are drained, I will apply the pdb yaml back on to the cluster.

❯ gcc kg pdb -n nsxi-platform kafka -oyaml > pdb-nsxi-platform-kafka.yaml
❯ code pdb-nsxi-platform-kafka.yaml
❯ gcc kg pdb -n nsxi-platform zookeeper -oyaml > pdb-nsxi-platform-zookeeper.yaml
❯ code pdb-nsxi-platform-zookeeper.yaml

❯ gcc k delete pdb kafka -n nsxi-platform
poddisruptionbudget.policy "kafka" deleted
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   zookeeper   N/A             1                 1                     188d
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
pod/kafka-2 evicted
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz evicted
❯

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz drained


❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw --ignore-daemonsets
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw drained

The worker nodes are now drained.

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

As soon as the worker nodes are drained, one of them got successfully removed/ deleted, but the other worker node is still present. When we look at the machine resource, you can still see one of the worker machine is still stuck at Deleting phase. In this case I've manually deleted the worker node, still the corresponding worker machine is stuck at Deleting phase.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1


❯ gcc k delete node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw" deleted
❯
❯ gcc kg no
NAME                                           STATUS   ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready    control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready    <none>                 139d   v1.20.9+vmware.1

Now lets describe the worker machine stuck at Deleting. In this case you can see that there are two PVCs stuck at Terminating status. So I just edited those two PVCs yaml and set finalizer to null.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1



❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d


❯ kd machine sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw -n karvea-vc17ns11

Events:
  Type    Reason                  Age                   From                           Message
  ----    ------                  ----                  ----                           -------
  Normal  DetectedUnhealthy       13m (x2 over 17m)     machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has unhealthy node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
  Normal  SuccessfulDrainNode     13m (x2 over 19m)     machine-controller             success draining Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  NodeVolumesDetached     12m (x2 over 19m)     machine-controller             success waiting for node volumes detach Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  MachineMarkedUnhealthy  106s (x4 over 9m58s)  machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has been marked as unhealthy
❯
❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound         pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound         pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound         pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound         pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound         pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound         pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound         pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound         pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound         pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound         pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound         pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound         pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound         pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-755m6-containerd                                Terminating   pvc-da2e4866-bb41-4f74-a4b7-0f74bc7061a1   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-dgmjs-containerd                                Terminating   pvc-64eac528-f160-444c-9a0f-0ed9f6393e06   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound         pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound         pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
❯

As soon as the PVCs are removed, you can see the worker machine that was stuck at Deleting got removed, and the TKC chaged its status to running.

❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound    pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound    pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound    pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound    pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound    pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound    pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound    pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound    pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound    pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound    pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound    pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound    pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound    pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound    pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound    pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE     AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running   139d   v1.20.9+vmware.1

❯ kgtkca | grep karvea
karvea-vc17ns11                             sc201vc17pace           running    2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Note: The above case is a sample scenario and the reasons why the TKC is stuck at updating may vary based on several conditions. This is a generic method one can follow while approaching these kind of issues.

Hope it was useful. Cheers!

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.

Case 1:

TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.

Following is a sample log for this issue from the vmop-controller-manger:

Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error

This can be resolved by just editing the content library and accepting new certificate thumbprint.

Case 2:

Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.

❯ kubectl get tkr
No resources found

This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.

If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.

You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.

Case 3:

TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:

E0803 18:51:30.638787       1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638821       1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638851       1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.639301       1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"

This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:

get service cm-inventory  
restart service cm-inventory

Case 4:

Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.

> Get-ContentLibrary | select Name,Id | fl

Name : wdc-01-vc18c01-wcp
Id   : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d

> kg contentsources
NAME                                   AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db   173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d   321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662   242d
75e0668c-0cdc-421e-965d-fd736187cc57   173d
818c8700-efa4-416b-b78f-5f22e9555952   173d
9abbd108-aeb3-4b50-b074-9e6c00473b02   173d
a6cd1685-49bf-455f-a316-65bcdefac7cf   173d
acff9a91-0966-4793-9c3a-eb5272b802bd   242d
fcc08a43-1555-4794-a1ae-551753af9c03   173d

In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.

Hope it was useful. Cheers!

Saturday, June 25, 2022

GitOps using Argo CD - Part1

In this article we will see how to use Git and Argo CD to deploy/ manage applications on your Kubernetes cluster. Before that, what is GitOps? Simply put, Operations driven using Git and CD tools! Following are the major components of GitOps:

Infrastructure as code (IaC)
Merge/ pull requests as change agent
Continuous Delivery tool (Example: Argo CD)

Basically you keep all your application deployment manifests in a Git repository. And, if you like to make changes to your application, you create a merge/ pull request. Once it is approved and merged to the main branch a continuous delivery/ deployment tool like Argo CD will identify that and deploys the latest change to the target Kubernetes cluster. Here I am using a Tanzu Kubernetes Cluster with 1 control plane node and 3 worker nodes. I will be deploying Argo CD as well as my application on this K8s cluster.

❯ kubectl get node
NAME                                STATUS   ROLES                  AGE   VERSION
gc-control-plane-rhpmq              Ready    control-plane,master   34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-692n5   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-jfzrs   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-xvjsh   Ready    <none>                 34d   v1.21.6+vmware.1

Lets create a namespace first.

❯ kubectl create namespace argocd
namespace/argocd created

❯ kubectl apply -f https://gist.githubusercontent.com/vineethac/dafa5b47afd674a1a9f7be2ce773a2bd/raw/4591e837098043b8095a5c48614e1c94b5ca2b44/tkg-psp.yml
clusterrole.rbac.authorization.k8s.io/psp:privileged created
clusterrolebinding.rbac.authorization.k8s.io/all:psp:privileged created

Following yaml file will install Argo CD:

❯ kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

❯ kubectl get all -n argocd
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/argocd-application-controller-0                     1/1     Running   0          20h
pod/argocd-applicationset-controller-5f7d8fffb7-82xgp   1/1     Running   0          20h
pod/argocd-dex-server-75f7cff9cd-7vc64                  1/1     Running   0          20h
pod/argocd-notifications-controller-69bf646f87-8bt5n    1/1     Running   0          20h
pod/argocd-redis-748569f956-bskfw                       1/1     Running   0          19h
pod/argocd-repo-server-8699756b5d-7qmx2                 1/1     Running   0          20h
pod/argocd-server-6dd9cd7964-gbfm4                      1/1     Running   0          20h

NAME                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-applicationset-controller          ClusterIP   10.106.103.231   <none>        7000/TCP,8080/TCP            20h
service/argocd-dex-server                         ClusterIP   10.103.79.207    <none>        5556/TCP,5557/TCP,5558/TCP   20h
service/argocd-metrics                            ClusterIP   10.105.254.212   <none>        8082/TCP                     20h
service/argocd-notifications-controller-metrics   ClusterIP   10.97.254.140    <none>        9001/TCP                     20h
service/argocd-redis                              ClusterIP   10.97.244.161    <none>        6379/TCP                     20h
service/argocd-repo-server                        ClusterIP   10.101.181.242   <none>        8081/TCP,8084/TCP            20h
service/argocd-server                             ClusterIP   10.105.76.149    <none>        80/TCP,443/TCP               20h
service/argocd-server-metrics                     ClusterIP   10.100.168.241   <none>        8083/TCP                     20h

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/argocd-applicationset-controller   1/1     1            1           20h
deployment.apps/argocd-dex-server                  1/1     1            1           20h
deployment.apps/argocd-notifications-controller    1/1     1            1           20h
deployment.apps/argocd-redis                       1/1     1            1           20h
deployment.apps/argocd-repo-server                 1/1     1            1           20h
deployment.apps/argocd-server                      1/1     1            1           20h

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/argocd-applicationset-controller-5f7d8fffb7   1         1         1       20h
replicaset.apps/argocd-dex-server-75f7cff9cd                  1         1         1       20h
replicaset.apps/argocd-notifications-controller-69bf646f87    1         1         1       20h
replicaset.apps/argocd-redis-748569f956                       1         1         1       19h
replicaset.apps/argocd-repo-server-8699756b5d                 1         1         1       20h
replicaset.apps/argocd-server-6dd9cd7964                      1         1         1       20h

NAME                                             READY   AGE
statefulset.apps/argocd-application-controller   1/1     20h

Lets port forward, so that you can access the Argo CD web UI.

❯ kubectl port-forward -n argocd service/argocd-server 8080:443
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Now you can access Argo CD in your web browser at https://localhost:8080/

username: admin
password: you can decode it from the following secret

❯ kubectl get secret argocd-initial-admin-secret -n argocd -o yaml
apiVersion: v1
data:
password: TTFSRXJJZ0NKbHc2Y2JINA==
kind: Secret
metadata:
creationTimestamp: "2022-06-22T13:00:14Z"
name: argocd-initial-admin-secret
namespace: argocd
resourceVersion: "39224285"
uid: d3b7c82e-0c92-4418-95eb-95a73fe674b6
type: Opaque

❯ echo TTFSRXJJZ0NKbHc2Y2JINA== | base64 --decode

Next step is to create a repository and add your application yaml files to it. You will also need to create a Argo CD application yaml file, push it to the repository, and then apply the same application yaml file to your cluster.

Following is a screenshot of my repo:

We have the application.yaml file and then inside the dev folder I have two yaml files.

Now, we also have the application yaml file. Make sure to paste your git repo url in the repoURL field.

Lets apply the application yaml manifest on to the Kubernetes cluster and that will connect Argo CD with your git repo.

Note: If you using a private repo, then you need to add your repository to Argo CD first and connect with respective credentials.

> kubectl apply -f /Users/vineetha/myrepo/argocd/application.yaml

Once the application yaml is applied, after few seconds you can see nginx pod and svc getting deployed automatically under myapp namespace.

❯ kubectl get pods,deployment,svc -n myapp
NAME                            READY   STATUS    RESTARTS   AGE
pod/my-nginx-74d7c6cb98-tzdfp   1/1     Running   0          2d

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-nginx   1/1     1            1           2d

NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/my-nginx   ClusterIP   10.109.67.35   <none>        80/TCP    2d

From now on if you like to make changes to your application, say, you want to scale the replicas from 1 to 3, you have to create a merge request or pull request to the repository with the respective change, and once its approved and merged to the main branch Argo CD will automatically detect it and pull it and apply it to your Kubernetes cluster.

Hope it was useful. Cheers!

Reference video by Nana https://twitter.com/Njuchi_

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces

Following code snipet uses Python client for the kubernetes API to get namespace details from a given context:

from kubernetes import client, config
import argparse


def load_kubeconfig(context_name):
    config.load_kube_config(context=f"{context_name}")
    v1 = client.CoreV1Api()
    return v1


def get_all_namespace(v1):
    print("Listing namespaces with their creation timestamp, and status:")
    ret = v1.list_namespace()
    for i in ret.items:
        print(i.metadata.name, i.metadata.creation_timestamp, i.status.phase)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    args = parser.parse_args()

    context = args.context
    v1 = load_kubeconfig(context)
    get_all_namespace(v1)


if __name__ == "__main__":
    main()

Pages

Sunday, September 11, 2022

vSphere with Tanzu using NSX-T - Part19 - Troubleshooting TKC stuck at creating phase

Verify status of the TKC

Cluster health

Terminating namespaces

IP_BLOCK_EXHAUSTED

Resource availability

Check whether storage policy is assigned to the namespace

Check Content library can sync properly

KCP can't remediate

TKC VMs waiting for IP

VirtualMachineClassBindingNotFound

Verify NSX-T objects

Check for broken TKCs in the cluster

Restart system pods

Saturday, August 13, 2022

vSphere with Tanzu using NSX-T - Part18 - Troubleshooting vSphere pods with ProviderFailed status

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

Following is a sample log for this issue from the vmop-controller-manger:

Case 3:

Saturday, June 25, 2022

GitOps using Argo CD - Part1

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces