vineethac.blogspot.com: vmop

Showing posts with label vmop. Show all posts

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state.

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
       5

❯ gcc kg no
NAME                                STATUS                        ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready                         control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-tnhv9              NotReady                      control-plane,master   63d    v1.20.9+vmware.1
gc-control-plane-tqvnk              NotReady                      control-plane,master   50d    v1.20.9+vmware.1
gc-control-plane-wsclb              NotReady                      <none>                 8d     v1.20.9+vmware.1
gc-control-plane-wt6sx              NotReady                      <none>                 30d    v1.20.9+vmware.1
gc-control-plane-zthnq              NotReady                      control-plane,master   49d    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                         <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                         <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                         <none>                 458d   v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready    control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯
❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01    gc       updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-01-01T19:19:45Z
    Status:                True
    Type:                  AddonsReady
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-07-24T15:53:06Z
    Status:                True
    Type:                  NodePoolsReady
    Last Transition Time:  2022-09-01T09:02:26Z
    Message:               3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
    Status:                True
    Type:                  NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
 gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME                                STATUS                     ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready,SchedulingDisabled   control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                      <none>                 458d   v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted
❯
❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯

After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE          AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running        124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running        123d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc                                                                                             Provisioning   13s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running        458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running        456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running        458d   v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   124d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc        gc-control-plane-dnr66              vsphere://42011228-b156-3338-752a-e7233c9258dd   Running   2m2s   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS     ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready      control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              NotReady   control-plane,master   35s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready      <none>                 458d   v1.20.9+vmware.1


❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              Ready    control-plane,master   53s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01     gc     running      2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Hope it was useful. Cheers!

Sunday, September 11, 2022

vSphere with Tanzu using NSX-T - Part19 - Troubleshooting TKC stuck at creating phase

This article provides basic troubleshooting steps for TKCs (Tanzu Kubernetes Cluster) stuck at creating phase.

Verify status of the TKC

Use the following commands to verify the TKC status.

kubectl get tkc -n <supervisor_namespace>
kubectl get tkc -n <supervisor_namespace> -o json
kubectl describe tkc <tkc_name> -n <supervisor_namespace>
kubectl get cluster-api -n <supervisor_namespace>
kubectl get vm,machine,wcpmachine -n <supervisor_namespace>

Cluster health

Verify health of the supervisor cluster.

❯ kubectl get node
NAME                               STATUS   ROLES                  AGE   VERSION
4201a7b2667b0f3b021efcf7c9d1726b   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
4201bead67e21a8813415642267cd54a   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
4201e0e8e29b0ddb4b59d3165dd40941   Ready    control-plane,master   86d   v1.22.6+vmware.wcp.2
wxx-08-r02esx13.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx14.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx15.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx16.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx17.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx18.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx19.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx20.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx21.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx22.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx23.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
wxx-08-r02esx24.xxxxxyyyy.com      Ready    agent                  85d   v1.22.6-sph-db56d46
❯
❯ kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

Terminating namespaces

Check for namespaces stuck at terminating phase. If there are any, properly clean them up by removing all child objects.
You can use this kubectl get-all plugin to see all resources under a namespace. Then clean them up properly. Mostly you need to set finalizers of remaining child resources to null. Following is a sample case where 2 PVCs where stuck at terminating and they were cleaned up by setting its finalizers to null.

❯ kg ns | grep Terminating
rgettam-gettam                                  Terminating   226d
❯
❯ k get-all -n rgettam-gettam
NAME                                                                                             NAMESPACE       AGE
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d  rgettam-gettam  86d
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f  rgettam-gettam  86d
❯
❯ kg pvc -n rgettam-gettam
NAME                                                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d   Terminating   pvc-bd4252fb-bfed-4ef3-ab5a-43718f9cbed5   8Gi        RWO            sxx-01-vcxx-wcp-mgmt   86d
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f   Terminating   pvc-8bc9daa1-21cf-4af2-973e-af28d66a7f5e   30Gi       RWO            sxx-01-vcxx-wcp-mgmt   86d
❯
❯ kg pvc -n rgettam-gettam --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n rgettam-gettam pvc {} -p '{"metadata":{"finalizers": null}}'

You can also do kubectl get namespace <namespace> -oyaml and the status section will show if there are resources/ content to be deleted or any finalizers remaining.
Verify vmop-controller pod logs, and restart them if required.

IP_BLOCK_EXHAUSTED

Check CIDR usage of the supervisor cluster.

❯ kg clusternetworkinfos
NAME                                                AGE
domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4   160d
❯
❯ kg clusternetworkinfos domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4 -o json | jq .usage
{
  "egressCIDRUsage": {
    "allocated": 33,
    "total": 1024
  },
  "ingressCIDRUsage": {
    "allocated": 42,
    "total": 1024
  },
  "subnetCIDRUsage": {
    "allocated": 832,
    "total": 1024
  }
}

When the IP blocks of supervisor cluster are exhausted, you will find the following warning when you describe the TKC.

 Conditions:
    Last Transition Time:  2022-10-05T18:34:35Z
    Message:               Cannot realize subnet
    Reason:                ClusterNetworkProvisionFailed
    Severity:              Warning
    Status:                False
    Type:                  Ready

Also when you check the namespace, you can see the following ncp error IP_BLOCK_EXHAUSTED.

 ❯ kg ns tsql-integration-test -oyaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    calaxxxx.xxxyy.com/xxxrole-created: "1"
    ncp/error: IP_BLOCK_EXHAUSTED
    ncp/router_id: t1_d0a2af0f-8430-4250-9fcf-807a4afe51aa_rtr
    vmware-system-resource-pool: resgroup-307480
    vmware-system-vm-folder: group-v307481
  creationTimestamp: "2022-10-05T17:35:18Z"

Notes:

If the subnetCIDRUsage IP block is exhausted, you may need to remove some old/ unused namespaces, and that will release some IPs. If that is not possible, you may need to consider adding new subnet.

After removing the old/ unused namespaces, and even if IPs are available, sometimes the TKCs will be stuck at creating phase! In that case, check the ncp, vmop, and capw controller pods and you may need to restart them. What I observed is usually after restart of ncp pod, vmop-controller pods, and all pods under vmware-system-capw namespaces the VMs will start getting deployed and the TKC creation will progress and complete successfully.

Resource availability

Check whether there are enough resources available in the cluster.

LAST SEEN  TYPE   REASON       OBJECT                    MESSAGE
3m23s    Warning  UpdateFailure   virtualmachine/magna3-control-plane-9rhl4   The host does not have sufficient CPU resources to satisfy the reservation.
80s     Warning  ReconcileFailure  wcpmachine/magna3-control-plane-s5s9t-p2cxj  vm is not yet powered on: vmware-system-capw-controller-manager/WCPMachine//chakravartha-magna3/magna3/magna3-control-plane-s5s9t-p2cxj

Check for resource limits applied to the namespace.

Check whether storage policy is assigned to the namespace

27m         Warning   ReconcileFailure               wcpmachine/gc-pool-0-cv8vz-5snbc          admission webhook "default.validating.virtualmachine.vmoperator.xxxyy.com" denied the request: StorageClass wdc-10-vc21c01-wcp-pod is not assigned to any ResourceQuotas in namespace mpereiramaia-demo2

In this case, the storage policy wasnt assigned to the ns. I assigned the storage policy wdc-10-vc21c01-wcp-pod to the respective namespace, and the TKC deployment was successful.

Check Content library can sync properly

Sometimes issues related to CL can cause TKCs to get stuck at creating phase! Check this blog post for more details.

KCP can't remediate

Message:               KCP can't remediate if current replicas are less or equal then 1
Reason:                WaitingForRemediation @ Machine/gc-control-plane-zpssc
Severity:              Warning

In this case, you can just edit the TKC spec, change the control plane vmclass to a different class and save. Once the deployment is complete and TKC is running, edit the TKC spec again and revert the vmclass that you modified earlier to its original class. This process will re-provision the control plane.

TKC VMs waiting for IP

In this case, take a look at NSXT and check whether all Edge nodes are healthy. If there are mismatch errors, resolve them.
You may also check ncp pod logs and restart ncp pod if required.

VirtualMachineClassBindingNotFound

Conditions:
    Last Transition Time:  2021-05-05T18:19:10Z
    Message:               1 of 2 completed
    Reason:                VirtualMachineClassBindingNotFound @ Machine/tkc-dev-control-plane-wxd57
    Severity:              Error
    Status:                False
    Message:               0/1 Control Plane Node(s) healthy. 0/2 Worker Node(s) healthy
Events:
  Normal  PhaseChanged  7m22s  vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller  cluster changes from creating phase to failed phase

This happens when the virtualmachineclassbindings are missing and can be resolved by adding all/ required VM Class to the Namespace using the vSphere Client. Following are the steps to add VM Classes to a namespace:

Log into vCenter web UI
From Hosts and Clusters > Select the namespace > Summary tab > VM Service tile > Click Manage VM Classes
Select all required VM Classes and click OK

Verify NSX-T objects

Issues at the NSX-T side can also cause the TKC to be stuck at creating phase. Following is a sample case and you can see these logs when you describe the TKC:

Message: 2 errors occurred:

   * failed to configure DNS for /, Kind= namespace-test-01/gc: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found
   * failed to configure kube-proxy for /, Kind= namespace-test-01/gc: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found

In this case, these were some issues with the virtual servers in loadbalancer. Some stale entries of virtual servers were still present and their IP didn't get removed properly and it was causing some intermittent connectivity issues to some of the other services of type loadbalancer. And, new TKC deployment within that affected namespace also gets stuck due to this. In our case we deleted the affected namespace, and recreated it, that cleaned up all those virtual server state entries and the load balancer, and new TKC deployments were successful. So it will be worth to check on the health and staus of NSX-T objects in case you have TKC deployment issues.

Check for broken TKCs in the cluster

Sometimes the TKC deployments are very slow and takes more than 30 minutes. In this case, you may notice that the first control plane VM will get deployed in like 30-45 minutes after the TKC creation has started. Look for vmop controller logs. Following is sample log:

❯ kail -n vmware-system-vmop
vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:44.725620       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.9.212:6443: connect: connection refused" "vmName"="ciroscosta-cartographer/kontinue-control-plane-svlk4" "result"=-1

vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:49.888653       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.2.66:6443: connect: connection refused" "vmName"="whaozhe-platform/gc-control-plane-mf4p5" "result"=-1

In the above case, two of the TKCs were broken/ stuck at updating phase and we were unable to connect to its control plane.

ciroscosta-cartographer    kontinue    updating       2021-10-29T18:47:46Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     2
whaozhe-platform           gc          updating       2022-01-27T03:59:31Z   v1.20.12+vmware.1-tkg.1.b9a42f3   1     10

After removing the namespaces with broken TKCs, new deployments were completing succesfully.

Restart system pods

Sometimes restart of some of the system controller pods resoves the issue. I usually delete all the pods of the following namespaces and they will get restarted in a few seconds.

k delete pod --all --namespace=vmware-system-vmop
k delete pod --all --namespace=vmware-system-capw
k delete pod --all --namespace=vmware-system-tkg
k delete pod --all --namespace=vmware-system-csi
k delete pod --all --namespace=vmware-system-nsx

Hope this was useful. Cheers!

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.

Case 1:

TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.

Following is a sample log for this issue from the vmop-controller-manger:

Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error

This can be resolved by just editing the content library and accepting new certificate thumbprint.

Case 2:

Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.

❯ kubectl get tkr
No resources found

This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.

If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.

You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.

Case 3:

TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:

E0803 18:51:30.638787       1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638821       1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638851       1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.639301       1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"

This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:

get service cm-inventory  
restart service cm-inventory

Case 4:

Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.

> Get-ContentLibrary | select Name,Id | fl

Name : wdc-01-vc18c01-wcp
Id   : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d

> kg contentsources
NAME                                   AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db   173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d   321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662   242d
75e0668c-0cdc-421e-965d-fd736187cc57   173d
818c8700-efa4-416b-b78f-5f22e9555952   173d
9abbd108-aeb3-4b50-b074-9e6c00473b02   173d
a6cd1685-49bf-455f-a316-65bcdefac7cf   173d
acff9a91-0966-4793-9c3a-eb5272b802bd   242d
fcc08a43-1555-4794-a1ae-551753af9c03   173d

In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.