vineethac.blogspot.com: TKC

Showing posts with label TKC. Show all posts

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Ideally if everything goes well the TKCs (Tanzu Kubernetes Cluster aka Guest Cluster) should be in running phase. But sometimes due to several reasons it may be stuck at updating phase. In this article, we will take a sample case and look at troubleshooting/ fixing it.

Following is an example:

NAMESPACE              NAME                    PHASE      CREATIONTIME           VERSION                           CP    WORKER
karvea-vc17ns11        sc201vc17pace           updating   2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Lets connect to this TKC. Here I have a small plugin (kubectl-gckc) that generates the TKC kubeconfig and gcc is alias to KUBECONFIG=gckubeconfig, where gckubeconfig is the TKC admin kubeconfig file.

❯ k gckc karvea-vc17ns11 sc201vc17pace
❯ gcc kg no
NAME                                           STATUS                     ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                      control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                      <none>                 139d   v1.20.9+vmware.1

❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    poweredOn    189d
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    poweredOn    189d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d



❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    vsphere://42010982-8b25-ad7b-2a1d-bb949def4834   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1

As you can see above, there are two worker machines that are stuck at Deleting phase. It is because the corresponding two worker nodes are at Ready, SchedulingDisabled status. The nodes are not drained yet due to some reason. Once they get drained properly, its status will be changed to NotReady, SchedulingDisabled. Now lets try to drain those worker nodes manually.

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
cannot delete Pods with local storage (use --delete-emptydir-data to override): nsxi-platform/kafka-2
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
❯ gcc kg pdb
No resources found in default namespace.
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   kafka       N/A             1                 0                     188d
nsxi-platform   zookeeper   N/A             1                 1                     188d

Here this worker node sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz is not getting drained because of the presence of a pod disruption budget (pdb). So, in-order to drain the node, I am taking a back up of the pdb yaml file and delete it. And once the nodes are drained, I will apply the pdb yaml back on to the cluster.

❯ gcc kg pdb -n nsxi-platform kafka -oyaml > pdb-nsxi-platform-kafka.yaml
❯ code pdb-nsxi-platform-kafka.yaml
❯ gcc kg pdb -n nsxi-platform zookeeper -oyaml > pdb-nsxi-platform-zookeeper.yaml
❯ code pdb-nsxi-platform-zookeeper.yaml

❯ gcc k delete pdb kafka -n nsxi-platform
poddisruptionbudget.policy "kafka" deleted
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   zookeeper   N/A             1                 1                     188d
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
pod/kafka-2 evicted
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz evicted
❯

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz drained


❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw --ignore-daemonsets
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw drained

The worker nodes are now drained.

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

As soon as the worker nodes are drained, one of them got successfully removed/ deleted, but the other worker node is still present. When we look at the machine resource, you can still see one of the worker machine is still stuck at Deleting phase. In this case I've manually deleted the worker node, still the corresponding worker machine is stuck at Deleting phase.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1


❯ gcc k delete node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw" deleted
❯
❯ gcc kg no
NAME                                           STATUS   ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready    control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready    <none>                 139d   v1.20.9+vmware.1

Now lets describe the worker machine stuck at Deleting. In this case you can see that there are two PVCs stuck at Terminating status. So I just edited those two PVCs yaml and set finalizer to null.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1



❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d


❯ kd machine sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw -n karvea-vc17ns11

Events:
  Type    Reason                  Age                   From                           Message
  ----    ------                  ----                  ----                           -------
  Normal  DetectedUnhealthy       13m (x2 over 17m)     machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has unhealthy node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
  Normal  SuccessfulDrainNode     13m (x2 over 19m)     machine-controller             success draining Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  NodeVolumesDetached     12m (x2 over 19m)     machine-controller             success waiting for node volumes detach Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  MachineMarkedUnhealthy  106s (x4 over 9m58s)  machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has been marked as unhealthy
❯
❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound         pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound         pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound         pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound         pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound         pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound         pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound         pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound         pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound         pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound         pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound         pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound         pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound         pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-755m6-containerd                                Terminating   pvc-da2e4866-bb41-4f74-a4b7-0f74bc7061a1   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-dgmjs-containerd                                Terminating   pvc-64eac528-f160-444c-9a0f-0ed9f6393e06   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound         pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound         pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
❯

As soon as the PVCs are removed, you can see the worker machine that was stuck at Deleting got removed, and the TKC chaged its status to running.

❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound    pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound    pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound    pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound    pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound    pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound    pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound    pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound    pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound    pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound    pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound    pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound    pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound    pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound    pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound    pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE     AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running   139d   v1.20.9+vmware.1

❯ kgtkca | grep karvea
karvea-vc17ns11                             sc201vc17pace           running    2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Note: The above case is a sample scenario and the reasons why the TKC is stuck at updating may vary based on several conditions. This is a generic method one can follow while approaching these kind of issues.

Hope it was useful. Cheers!

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.

Case 1:

TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.

Following is a sample log for this issue from the vmop-controller-manger:

Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error

This can be resolved by just editing the content library and accepting new certificate thumbprint.

Case 2:

Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.

❯ kubectl get tkr
No resources found

This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.

If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.

You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.

Case 3:

TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:

E0803 18:51:30.638787       1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638821       1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638851       1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.639301       1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"

This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:

get service cm-inventory  
restart service cm-inventory

Case 4:

Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.

> Get-ContentLibrary | select Name,Id | fl

Name : wdc-01-vc18c01-wcp
Id   : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d

> kg contentsources
NAME                                   AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db   173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d   321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662   242d
75e0668c-0cdc-421e-965d-fd736187cc57   173d
818c8700-efa4-416b-b78f-5f22e9555952   173d
9abbd108-aeb3-4b50-b074-9e6c00473b02   173d
a6cd1685-49bf-455f-a316-65bcdefac7cf   173d
acff9a91-0966-4793-9c3a-eb5272b802bd   242d
fcc08a43-1555-4794-a1ae-551753af9c03   173d

In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.

Hope it was useful. Cheers!

Saturday, May 21, 2022

vSphere with Tanzu using NSX-T - Part15 - Working with etcd on TKC with one control plane

In this article, we will see how to work with etcd database of a Tanzu Kubernetes Cluster (TKC) with one control plane node and perform some basic operations. Following is a TKC with one control plane node and three worker nodes:

Get K8s cluster nodes

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE     VERSION
gc-control-plane-6g9gk              Ready    control-plane,master   3d7h    v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-n5qp8   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-wds2m   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-z2wvt   Ready    <none>                 7d19h   v1.21.6+vmware.1

Get the etcd pod and describe it

❯ gcc kg pod -A | grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          3d7h
❯
❯ gcc kd pod etcd-gc-control-plane-6g9gk -n kube-system
Name:                 etcd-gc-control-plane-6g9gk
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gc-control-plane-6g9gk/100.68.36.38
Start Time:           Tue, 19 Jul 2022 11:14:25 +0530
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
                      kubernetes.io/config.hash: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.mirror: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.seen: 2022-07-19T05:44:19.416549595Z
                      kubernetes.io/config.source: file
                      kubernetes.io/psp: vmware-system-privileged
Status:               Running
IP:                   100.68.36.38
IPs:
  IP:           100.68.36.38
Controlled By:  Node/gc-control-plane-6g9gk
Containers:
  etcd:
    Container ID:  containerd://253c7b25bd60ea78dfccad52d03534785f0d7b7a1fa7105dbd55d7727f8785c3
    Image:         localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    Image ID:      sha256:78661ebbe1adaee60336a0f8ff031c4537ff309ef51feab6e840e7dbb3cbf47d
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://100.68.36.38:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://100.68.36.38:2380
      --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
      --initial-cluster-state=existing
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://100.68.36.38:2380
      --name=gc-control-plane-6g9gk
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Tue, 19 Jul 2022 11:14:27 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     100Mi
    Liveness:     http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:  <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:            <none>

Exec into the etcd pod and run etcdctl commands

You can use etcdctl and you need to provide cacert, cert, and key details. All these info you will get while describing the etcd pod.

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
❯
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint health --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --write-out json"
[{"endpoint":"127.0.0.1:2379","health":true,"took":"9.689387ms"}]
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key -w json"
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":4073335150581888229,"member_id":14250600431682432472,"revision":2153804,"raft_term":11},"version":"3.4.13","dbSize":24719360,"leader":14250600431682432472,"raftIndex":2429139,"raftTerm":11,"raftAppliedIndex":2429139,"dbSizeInUse":2678784}}]

Snapshot etcd

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot save snapshotdb-$(date +%d-%m-%y)"
{"level":"info","ts":1658651541.2698474,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb-24-07-22.part"}
{"level":"info","ts":"2022-07-24T08:32:21.277Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1658651541.2771788,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-07-24T08:32:21.594Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1658651541.621639,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"25 MB","took":0.344746859}
{"level":"info","ts":1658651541.621852,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb-24-07-22"}
Snapshot saved at snapshotdb-24-07-22
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh
# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  snapshotdb-24-07-22  srv  sys  tmp  usr	var
#
# exit
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot status snapshotdb-24-07-22 -w table"
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+
❯

Copy snapshot file from etcd pod to local machine

Note: Even though there was an error while copying the snapshot file from the pod to local machine, you can see the file was successfully copied and I also verified the snapshot file status using etcdctl. Every field (hash, total_keys, etc.) matches with that of the source file.

❯ gcc kubectl cp kube-system/etcd-gc-control-plane-6g9gk:/snapshotdb-24-07-22 snapshotdb-24-07-22
tar: Removing leading `/' from member names
error: unexpected EOF
❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+       
❯

Restore etcd

We can restore the etcd snapshot using etcdctl from the TKC control plane node. Inorder to connect to the control plane VM, we need to create a jumpbox pod under the corresponding supervisor namespace.

So, first copy the snapshot file from local machine to jumpbox pod.

❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ kubectl cp snapshotdb-24-07-22 vineetha-test04-deploy/jumpbox01:/
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- sh
sh-4.4# su
root [ / ]# ls
bin   dev  home  lib64  mnt   root  sbin                 srv  tmp  var
boot  etc  lib   media  proc  run   snapshotdb-24-07-22  sys  usr
root [ / ]#

Copy the snapshot file from jumpbox pod to control plane node.

❯ gcc kg po -A -o wide| grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          166m   100.68.36.38   gc-control-plane-6g9gk              <none>           <none>
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- scp /snapshotdb-24-07-22 vmware-system-user@100.68.36.38:/tmp
snapshotdb-24-07-22                           100%   20MB 126.1MB/s   00:00
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- /usr/bin/ssh vmware-system-user@100.68.36.38
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Sun Jul 24 13:02:39 2022 from 100.68.35.210
 13:14:29 up 5 days,  7:38,  0 users,  load average: 0.98, 0.53, 0.31

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ sudo su
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# cd /tmp/

Install etcd on the control plane node, so that we get to access etcdctl utility.

root [ /tmp ]# tdnf install etcd
root [ /tmp ]# ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
root [ /tmp ]#
root [ /tmp ]# ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
Deprecated: Use `etcdutl snapshot status` instead.

b0910e83, 2434362, 1580, 25 MB
root [ /tmp ]# hostname
gc-control-plane-6g9gk
root [ /tmp ]# ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot restore /tmp/snapshotdb-24-07-22 --data-dir=/var/lib/etcd-backup --skip-hash-check=true
Deprecated: Use `etcdutl snapshot restore` instead.

2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:251	restoring snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/snapshot/v3_snapshot.go:257\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/command/snapshot_command.go:128\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/main.go:59\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
2022-07-24T13:20:44Z	info	membership/store.go:141	Trimming membership information from the backend...
2022-07-24T13:20:44Z	info	membership/cluster.go:421	added member	{"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]}
2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:272	restored snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap"}
root [ /var/lib ]#

We have restored the database snapshot to a new location: --data-dir=/var/lib/etcd-backup. So we need to modify the etcd-data hostpath to path: /var/lib/etcd-backup in the etcd static pod manifest file (etcd.yaml). Copy the contents of etcd.yaml file.

root [ /var/lib ]# cd /etc/kubernetes/manifests/
root [ /etc/kubernetes/manifests ]# ls
etcd.yaml            kube-controller-manager.yaml  registry.yaml
kube-apiserver.yaml  kube-scheduler.yaml
root [ /etc/kubernetes/manifests ]# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://100.68.36.38:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://100.68.36.38:2380
    - --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://100.68.36.38:2380
    - --name=gc-control-plane-6g9gk
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

I was having difficulties in the terminal to modify it. So I copied the contents of etcd.yaml file locally, modified the path, removed the existing etcd.yaml file, created new etcd.yaml file, and pasted the modifed content in it.

root [ /etc/kubernetes/manifests ]# rm etcd.yaml
root [ /etc/kubernetes/manifests ]# vi etcd.yaml

<paste the above etcd.yaml file contents, with modified etcd-data hostPath, last part of the yaml will look like below:

 volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd-backup
      type: DirectoryOrCreate
    name: etcd-data
status: {}

>

Once the etcd.yaml is saved, after few seconds you can see that etcd pod will be running.

root [ /etc/kubernetes/manifests ]# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
92dad43a85ebc       78661ebbe1ada       1 second ago        Running             etcd                           0                   8258804cf17bb
5c704092eb4bb       25605c4ab20fe       10 seconds ago      Running             csi-resizer                    10                  1fa9feb732df5
495b5ff250cb4       05cfd9e3c3f22       10 seconds ago      Running             csi-provisioner                22                  1fa9feb732df5
c0da43af7f1d0       a145efcc3afb4       11 seconds ago      Running             vsphere-syncer                 10                  1fa9feb732df5
4e6d67dc16f4a       5cb2119a4d797       11 seconds ago      Running             kube-controller-manager        11                  3d1c266aa5e24
8710a7b6a8563       fa70d7ee973ad       11 seconds ago      Running             guest-cluster-cloud-provider   35                  3f3d5eb0929e5
4e0c1e2e72682       f18cde23836f5       11 seconds ago      Running             csi-attacher                   10                  1fa9feb732df5
bf6771ca4dc4d       a609b91a17410       11 seconds ago      Running             kube-scheduler                 11                  f50eada3f8127
05fa3f2f587e8       fa70d7ee973ad       3 hours ago         Exited              guest-cluster-cloud-provider   34                  3f3d5eb0929e5
888ba7ce34d92       3f6d2884f8105       3 hours ago         Running             kube-apiserver                 28                  c368bd9937f8b
6579ffffa5f53       382a8821c56e0       3 hours ago         Running             metrics-server                 21                  622c835648008
bdd0c760d0bc7       382a8821c56e0       3 hours ago         Exited              metrics-server                 20                  622c835648008
6063485d73e38       78661ebbe1ada       3 hours ago         Exited              etcd                           0                   71e6ae04726ad
259595a330d26       3f6d2884f8105       3 hours ago         Exited              kube-apiserver                 27                  c368bd9937f8b
5034f4ea18f1f       25605c4ab20fe       4 hours ago         Exited              csi-resizer                    9                   1fa9feb732df5
21de3c4850dc3       05cfd9e3c3f22       4 hours ago         Exited              csi-provisioner                21                  1fa9feb732df5
d623946b1d270       a145efcc3afb4       4 hours ago         Exited              vsphere-syncer                 9                   1fa9feb732df5
5f50b9e93d287       a609b91a17410       4 hours ago         Exited              kube-scheduler                 10                  f50eada3f8127
ec4e066f54fd6       5cb2119a4d797       4 hours ago         Exited              kube-controller-manager        10                  3d1c266aa5e24
f05fee251a700       f18cde23836f5       4 hours ago         Exited              csi-attacher                   9                   1fa9feb732df5
d3577bf8477d0       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   1fa9feb732df5
d30ba30b8c203       4251b7012fd43       5 days ago          Running             vsphere-csi-controller         0                   1fa9feb732df5
1816ed6aada5f       02abc4bd595a0       5 days ago          Running             guest-cluster-auth-service     0                   bff70bcd389be
3e014a6745f5a       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   66e5ca3abfe5b
ba82f29cf8939       4251b7012fd43       5 days ago          Running             vsphere-csi-node               0                   66e5ca3abfe5b
9c29960956718       f3fe18dd8cea2       5 days ago          Running             node-driver-registrar          0                   66e5ca3abfe5b
92aa3a904d72e       0515f8357a522       5 days ago          Running             antrea-agent                   3                   4ae63a9f8e4cb
9e0f32fa88663       0515f8357a522       5 days ago          Exited              antrea-agent                   2                   4ae63a9f8e4cb
ac5573a4d93f8       0515f8357a522       5 days ago          Running             antrea-ovs                     0                   4ae63a9f8e4cb
e14fea2c37c21       0515f8357a522       5 days ago          Exited              install-cni                    0                   4ae63a9f8e4cb
166627360a434       7fde82047d4f6       5 days ago          Running             docker-registry                0                   c84c550a9ab90
ecfbdcd23858d       f31127f4a3471       5 days ago          Running             kube-proxy                     0                   07f5b9be02414
root [ /etc/kubernetes/manifests ]# exit
exit
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ exit
logout

Verify

In my case I had a namespace vineethac-testing with two nginx pods running under it while the snapshot was taken. After the snapshot was taken, I deleted the two nginx pods and the namespace vineethac-testing. After restoring the etcd snapshot, I can see that the namespace vineethac-testing is active with two nginx pods under it.

❯ gcc kg ns
NAME                           STATUS   AGE
default                        Active   9d
kube-node-lease                Active   9d
kube-public                    Active   9d
kube-system                    Active   9d
vineethac-testing              Active   4h28m
vmware-system-auth             Active   9d
vmware-system-cloud-provider   Active   9d
vmware-system-csi              Active   9d
❯
❯ gcc kg pods -n vineethac-testing
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          4h25m
nginx2   1/1     Running   0          4h23m

Hope it was useful. Cheers!

Note: I've tested this in a lab. This may not be the best practice procedure and may slightly vary in a real world environment.

Saturday, February 19, 2022

Kubernetes 101 - Part5 - Removing namespaces stuck in terminating state

Namespaces getting stuck at terminating state is one of the common issues I have seen while working with K8s. Here is an example namespace in terminating state and you can see there are no resources under it. In this case we are removing this namespace by setting the finalizers to null.

% kg ns rohitgu-intelligence-cluster-6
NAME STATUS AGE
rohitgu-intelligence-cluster-6 Terminating 188d

% kg pods,tkc,all,cluster-api -n rohitgu-intelligence-cluster-6
No resources found in rohitgu-intelligence-cluster-6 namespace.

% kg ns rohitgu-intelligence-cluster-6 -o json > rohitgu-intelligence-cluster-6-json

% jq '.spec.finalizers = [] | .metadata.finalizers = []' rohitgu-intelligence-cluster-6-json > rohitgu-intelligence-cluster-6-json-nofinalizer

% cat rohitgu-intelligence-cluster-6-json-nofinalizer
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
    "annotations": {
      "calaxxxxx.xxxx.com/ccsrole-created": "true",
      "calaxxxxx.xxxx.com/owner": "rohitgu",
      "calaxxxxx.xxxx.com/user-namespace": "rohitgu",
      "ls_id-0": "e28442b5-ace0-4e20-b5a0-c32bc72427d9",
      "ncp/extpoolid": "domain-c1034:02cde809-99d1-423e-aac9-014889740308-ippool-10-186-120-1-10-186-123-254",
      "ncp/router_id": "t1_87d44fc8-ac60-441a-8e35-509ff31a4eba_rtr",
      "ncp/snat_ip": "10.186.120.40",
      "ncp/subnet-0": "172.29.1.144/28",
      "vmware-system-resource-pool": "resgroup-663809",
      "vmware-system-vm-folder": "group-v663810"
    },
    "creationTimestamp": "2021-08-26T09:11:39Z",
    "deletionTimestamp": "2022-03-02T07:16:27Z",
    "labels": {
      "kubernetes.io/metadata.name": "rohitgu-intelligence-cluster-6",
      "vSphereClusterID": "domain-c1034"
    },
    "name": "rohitgu-intelligence-cluster-6",
    "resourceVersion": "1133900371",
    "selfLink": "/api/v1/namespaces/rohitgu-intelligence-cluster-6",
    "uid": "87d44fc8-ac60-441a-8e35-509ff31a4eba",
    "finalizers": []
},
"spec": {
    "finalizers": []
},
"status": {
    "conditions": [
      {
        "lastTransitionTime": "2022-03-02T07:16:32Z",
        "message": "Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: data.packaging.carvel.dev/v1alpha1: the server is currently unable to handle the request",
        "reason": "DiscoveryFailed",
        "status": "True",
        "type": "NamespaceDeletionDiscoveryFailure"
      },
      {
        "lastTransitionTime": "2022-03-02T07:16:56Z",
        "message": "All legacy kube types successfully parsed",
        "reason": "ParsedGroupVersions",
        "status": "False",
        "type": "NamespaceDeletionGroupVersionParsingFailure"
      },
      {
        "lastTransitionTime": "2022-03-02T07:16:56Z",
        "message": "All content successfully deleted, may be waiting on finalization",
        "reason": "ContentDeleted",
        "status": "False",
        "type": "NamespaceDeletionContentFailure"
      },
      {
        "lastTransitionTime": "2022-03-02T07:23:22Z",
        "message": "All content successfully removed",
        "reason": "ContentRemoved",
        "status": "False",
        "type": "NamespaceContentRemaining"
      },
      {
        "lastTransitionTime": "2022-03-02T07:23:22Z",
        "message": "All content-preserving finalizers finished",
        "reason": "ContentHasNoFinalizers",
        "status": "False",
        "type": "NamespaceFinalizersRemaining"
      }
    ],
    "phase": "Terminating"
}
}

% kubectl replace --raw "/api/v1/namespaces/rohitgu-intelligence-cluster-6/finalize" -f rohitgu-intelligence-cluster-6-json-nofinalizer

% kg ns rohitgu-intelligence-cluster-6
Error from server (NotFound): namespaces "rohitgu-intelligence-cluster-6" not found

Hope it was useful. Cheers!

Friday, October 15, 2021

Kubernetes 101 - Part4 - Kubectl autocomplete and alias

You can use the following to enable auto-completion for kubectl on MAC.

ZSH

Run the following on your terminal:
% source <(kubectl completion zsh)

If you are getting the below error:
/dev/fd/11:2: command not found: compdef

You might need to activate the completion system. Run the following on your terminal:
% autoload -Uz compinit
% compinit

% source <(kubectl completion zsh)
% echo "[[ $commands[kubectl] ]] && source <(kubectl completion zsh)" >> ~/.zshrc

Now you can use tab for auto-completion of kubectl commands.

Alias

You can create aliases and add them to your ~/.zshrc file. Following are the aliases i use:

alias k="kubectl"
alias kg="kubectl get"
alias kge="kubectl get events"
alias kd="kubectl describe"
alias kgtkc='kubectl get tkc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas --sort-by="metadata.creationTimestamp"'

After adding and saving the above in your ~/.zshrc file, make sure you relaunch the terminal. Now you are ready to use the aliases.

Example:

% kg tkc -n vineetha-test-node-dns
NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
gc 1 3 v1.19.14---vmware.1-tkg.1.8753786 68d False True [1.20.9+vmware.1-tkg.1.a4cee5b]

% kgtkc | grep vineetha
vineetha-test-node-dns gc running 2021-09-28T08:39:56Z v1.19.14+vmware.1-tkg.1.8753786 1 3

References

https://kubernetes.io/docs/reference/kubectl/cheatsheet/
https://unix.stackexchange.com/questions/339954/zsh-command-not-found-compinstall-compinit-compdef

Saturday, September 25, 2021

vSphere with Tanzu using NSX-T - Part11 - Troubleshooting Tanzu Kubernetes Clusters

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster

In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.

List all TKCs that are stuck at creating/ updating:
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating

On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.

alias kgtkc='kubectl get tkc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas --sort-by="metadata.creationTimestamp"'

You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:

% kgtkc | grep updating
c1nsxtest1-sla gc updating 2021-01-21T08:23:37Z v1.19.7+vmware.1-tkg.2.f52f85a 3 3
w2cei-sep20 gc updating 2021-09-16T17:48:07Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4

For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:

See events in a namespace:
kubectl get events -n <namespace>

See all events:
kubectl get events -A

Watch events in a namespace:
kubectl get events -n <namespace> -w

List the Cluster API resources supporting the clusters in the current namespace:
kubectl get cluster-api -n <namespace>

Describe TKC:
kubectl describe tkc <tkc_name> -n <namespace>

List TKC virtual machines in a namespace:
kubectl get vm -n <namespace>

List TKC virtual machines in a namespace with its IP:
kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'

List all nodes of a cluster:
kubectl get nodes -o wide

List all pods that are not running:
kubectl get pods -A | grep -vi running

List health status of different cluster components:
kubectl get --raw '/healthz?verbose'

% kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

List all CRDs installed in your cluster and their API versions:
kubectl api-resources -o wide --sort-by="name"

List available Tanzu Kubernetes releases:
kubectl get tanzukubernetesreleases

List available virtual machine images:
kubectl get virtualmachineimages

List terminating namespaces:
kubectl get ns --field-selector status.phase=Terminating

You can ssh to the Tanzu Kubernetes cluster nodes as the system user following this:
https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html

Here is an example where I have a TKC under namespace: vineetha-test05-deploy

% kubectl get tkc -n vineetha-test05-deploy
NAME   CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
gc     1               3        v1.20.9---vmware.1-tkg.1.a4cee5b   4d5h   True    True             [1.21.2+vmware.1-tkg.1.ee25d55]

% kubectl get vm -n vineetha-test05-deploy -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
[
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-control-plane-ttkmt",
    "internalIP": "172.29.4.194"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-d286z",
    "internalIP": "172.29.4.195"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-hwr8b",
    "internalIP": "172.29.4.197"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-r99x7",
    "internalIP": "172.29.4.196"
}
]

Given below is the yaml file that deploys a pod named jumpbox under the supervisor namespace vineetha-test05-deploy, and from there you can ssh to the TKC nodes.

apiVersion: v1
kind: Pod
metadata:
name: jumpbox
namespace: vineetha-test05-deploy           #REPLACE
spec:
containers:
- image: "photon:3.0"
    name: jumpbox
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
    volumeMounts:
      - mountPath: "/root/ssh"
        name: ssh-key
        readOnly: true
    resources:
      requests:
        memory: 2Gi
volumes:
    - name: ssh-key
      secret:
        secretName: gc-ssh     #REPLACE

Once you apply the above yaml, you can see the jumpbox pod.

% kubectl get pod -n vineetha-test05-deploy
NAME      READY   STATUS    RESTARTS   AGE
jumpbox   1/1     Running   0          22m

Now, you can connect to the TKC node with its internal IP.

% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34
16:50:34 up 4 days, 5:49, 0 users, load average: 2.14, 0.97, 0.65

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname
gc-control-plane-ttkmt

You can check the status of control plane pods using crictl ps.

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
bde228417c55a       9000c334d9197       4 days ago          Running             guest-cluster-auth-service     0                   d7abf3db8670d
bc4b8c1bf0e33       a294c1cf07bd6       4 days ago          Running             metrics-server                 0                   2665876cf939e
46a94dcf02f3e       92cb72974660c       4 days ago          Running             coredns                        0                   7497cdf3269ab
f7d32016d6fb7       f48f23686df21       4 days ago          Running             csi-resizer                    0                   b887d394d4f80
ef80f62f3ed65       2cba51b244f27       4 days ago          Running             csi-provisioner                0                   b887d394d4f80
64b570add2859       4d2e937854849       4 days ago          Running             liveness-probe                 0                   b887d394d4f80
c0c1db3aac161       d032188289eb5       4 days ago          Running             vsphere-syncer                 0                   b887d394d4f80
e4df023ada129       e75228f70c0d6       4 days ago          Running             vsphere-csi-controller         0                   b887d394d4f80
e79b3cfdb4143       8a857a48ee57f       4 days ago          Running             csi-attacher                   0                   b887d394d4f80
96e4af8792cd0       b8bffc9e5af52       4 days ago          Running             calico-kube-controllers        0                   b5e467a43b34a
23791d5648ebb       92cb72974660c       4 days ago          Running             coredns                        0                   9bde50bbfb914
0f47d11dc211b       ab1e2f4eb3589       4 days ago          Running             guest-cluster-cloud-provider   0                   fde68175c5d95
5ddfd46647e80       4d2e937854849       4 days ago          Running             liveness-probe                 0                   1a88f26173762
578ddeeef5bdd       e75228f70c0d6       4 days ago          Running             vsphere-csi-node               0                   1a88f26173762
3fcb8a287ea48       9a3d9174ac1e7       4 days ago          Running             node-driver-registrar          0                   1a88f26173762
91b490c14d085       dc02a60cdbe40       4 days ago          Running             calico-node                    0                   35cf458eb80f8
68dbbdb779484       f7ad2965f3ac0       4 days ago          Running             kube-proxy                     0                   79f129c96e6e1
ef423f4aeb128       75bfe47a404bb       4 days ago          Running             docker-registry                0                   752724fbbcd6a
26dd8e1f521f5       9358496e81774       4 days ago          Running             kube-apiserver                 0                   814e5d2be5eab
62745db4234e2       ab8fb8e444396       4 days ago          Running             kube-controller-manager        0                   94543f93f7563
f2fc30c2854bd       9aa6da547b7eb       4 days ago          Running             etcd                           0                   f0a756a4cdc09
b8038e9f90e15       212d4c357a28e       4 days ago          Running             kube-scheduler                 0                   533a44c70e86c

You can check the status of kubelet and containerd services:
sudo systemctl status kubelet.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status kubelet.service
WARNING: terminal is not fully functional
- (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:>
Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago
     Docs: http://kubernetes.io/docs/
Main PID: 2234 (kubelet)
    Tasks: 16 (limit: 4728)
   Memory: 88.6M
   CGroup: /system.slice/kubelet.service
           └─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>

Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785    >
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045    >

sudo systemctl status containerd.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status containerd.service
WARNING: terminal is not fully functional
- (press RETURN)● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres>
   Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago
     Docs: https://containerd.io
Main PID: 1783 (containerd)
    Tasks: 386 (limit: 4728)
   Memory: 639.3M
   CGroup: /system.slice/containerd.service
           ├─ 1783 /usr/local/bin/containerd
           ├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
           ├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>

If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:

vmware-system-user@gc-control-plane-ttkmt [ /var/log ]$ ls
audit                  devicelist sa                  vmware-vgauthsvc.log.0
auth.log               journal     sgidlist            vmware-vmsvc-root.log
btmp                   kubernetes stigreport.log      vmware-vmtoolsd-root.log
cloud-init.log         lastlog     suidlist            wtmp
cloud-init-output.log pods        tallylog
containers             private     vmware-imc
cron                   rpmcheck    vmware-network.log

Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-1

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-2

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-3

Hope it was useful. Cheers!

Pages

Saturday, July 30, 2022

Sunday, July 17, 2022

Following is a sample log for this issue from the vmop-controller-manger:

Case 3:

Saturday, May 21, 2022

Saturday, February 19, 2022

Friday, October 15, 2021

ZSH

Alias

References

Saturday, September 25, 2021