vineethac.blogspot.com

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Ideally if everything goes well the TKCs (Tanzu Kubernetes Cluster aka Guest Cluster) should be in running phase. But sometimes due to several reasons it may be stuck at updating phase. In this article, we will take a sample case and look at troubleshooting/ fixing it.

Following is an example:

NAMESPACE              NAME                    PHASE      CREATIONTIME           VERSION                           CP    WORKER
karvea-vc17ns11        sc201vc17pace           updating   2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Lets connect to this TKC. Here I have a small plugin (kubectl-gckc) that generates the TKC kubeconfig and gcc is alias to KUBECONFIG=gckubeconfig, where gckubeconfig is the TKC admin kubeconfig file.

❯ k gckc karvea-vc17ns11 sc201vc17pace
❯ gcc kg no
NAME                                           STATUS                     ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                      control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    Ready,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                      <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                      <none>                 139d   v1.20.9+vmware.1

❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    poweredOn    189d
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    poweredOn    189d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d



❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    vsphere://42010982-8b25-ad7b-2a1d-bb949def4834   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1

As you can see above, there are two worker machines that are stuck at Deleting phase. It is because the corresponding two worker nodes are at Ready, SchedulingDisabled status. The nodes are not drained yet due to some reason. Once they get drained properly, its status will be changed to NotReady, SchedulingDisabled. Now lets try to drain those worker nodes manually.

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
cannot delete Pods with local storage (use --delete-emptydir-data to override): nsxi-platform/kafka-2
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
❯ gcc kg pdb
No resources found in default namespace.
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   kafka       N/A             1                 0                     188d
nsxi-platform   zookeeper   N/A             1                 1                     188d

Here this worker node sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz is not getting drained because of the presence of a pod disruption budget (pdb). So, in-order to drain the node, I am taking a back up of the pdb yaml file and delete it. And once the nodes are drained, I will apply the pdb yaml back on to the cluster.

❯ gcc kg pdb -n nsxi-platform kafka -oyaml > pdb-nsxi-platform-kafka.yaml
❯ code pdb-nsxi-platform-kafka.yaml
❯ gcc kg pdb -n nsxi-platform zookeeper -oyaml > pdb-nsxi-platform-zookeeper.yaml
❯ code pdb-nsxi-platform-zookeeper.yaml

❯ gcc k delete pdb kafka -n nsxi-platform
poddisruptionbudget.policy "kafka" deleted
❯ gcc kg pdb -A
NAMESPACE       NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nsxi-platform   zookeeper   N/A             1                 1                     188d
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
pod/kafka-2 evicted
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz evicted
❯

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz drained


❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw", aborting command...

There are pending nodes to be drained:
 sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw --ignore-daemonsets
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw drained

The worker nodes are now drained.

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

❯ gcc kg no
NAME                                           STATUS                        ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready                         control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    NotReady,SchedulingDisabled   <none>                 189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready                         <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready                         <none>                 139d   v1.20.9+vmware.1

As soon as the worker nodes are drained, one of them got successfully removed/ deleted, but the other worker node is still present. When we look at the machine resource, you can still see one of the worker machine is still stuck at Deleting phase. In this case I've manually deleted the worker node, still the corresponding worker machine is stuck at Deleting phase.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1


❯ gcc k delete node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw" deleted
❯
❯ gcc kg no
NAME                                           STATUS   ROLES                  AGE    VERSION
sc201vc17pace-control-plane-zt99l              Ready    control-plane,master   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   Ready    <none>                 139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   Ready    <none>                 139d   v1.20.9+vmware.1

Now lets describe the worker machine stuck at Deleting. In this case you can see that there are two PVCs stuck at Terminating status. So I just edited those two PVCs yaml and set finalizer to null.

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE      AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    sc201vc17pace   sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw    vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5   Deleting   189d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running    139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running    139d   v1.20.9+vmware.1



❯ kg vm -n karvea-vc17ns11
NAME                                           POWERSTATE   AGE
sc201vc17pace-control-plane-zt99l              poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   poweredOn    139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   poweredOn    139d


❯ kd machine sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw -n karvea-vc17ns11

Events:
  Type    Reason                  Age                   From                           Message
  ----    ------                  ----                  ----                           -------
  Normal  DetectedUnhealthy       13m (x2 over 17m)     machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has unhealthy node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
  Normal  SuccessfulDrainNode     13m (x2 over 19m)     machine-controller             success draining Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  NodeVolumesDetached     12m (x2 over 19m)     machine-controller             success waiting for node volumes detach Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
  Normal  MachineMarkedUnhealthy  106s (x4 over 9m58s)  machinehealthcheck-controller  Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has been marked as unhealthy
❯
❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound         pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound         pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound         pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound         pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound         pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound         pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound         pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound         pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound         pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound         pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound         pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound         pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound         pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-755m6-containerd                                Terminating   pvc-da2e4866-bb41-4f74-a4b7-0f74bc7061a1   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-dgmjs-containerd                                Terminating   pvc-64eac528-f160-444c-9a0f-0ed9f6393e06   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   189d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound         pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound         pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
❯

As soon as the PVCs are removed, you can see the worker machine that was stuck at Deleting got removed, and the TKC chaged its status to running.

❯ kg pvc -n karvea-vc17ns11
NAME                                                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb   Bound    pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6   Bound    pvc-97e6e063-9a9e-4837-9999-284523379453   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624   Bound    pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811   Bound    pvc-faa7798e-c045-420f-9d09-44674d9d2326   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950   Bound    pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447   Bound    pvc-49fca2f0-3402-429f-884f-7db9012934d6   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0   Bound    pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c   Bound    pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e   10Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa   Bound    pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1   8Gi        RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009   Bound    pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40   128Gi      RWO            sc2-01-vc17c01-wcp-mgmt   188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c   Bound    pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256   20Gi       RWO            sc2-01-vc17c01-wcp-mgmt   188d
sc201vc17pace-workers-wswdh-2hz8w-containerd                                Bound    pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-5pjrc-containerd                                Bound    pvc-fb162388-4347-4f48-825e-c2c2d62ceb90   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-djp2m-containerd                                Bound    pvc-a7542552-de13-4670-ac45-84ed39c3c916   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d
sc201vc17pace-workers-wswdh-flwtt-containerd                                Bound    pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73   42Gi       RWO            sc2-01-vc17c01-wcp-mgmt   139d

❯ kg machine -n karvea-vc17ns11
NAME                                           CLUSTER         NODENAME                                       PROVIDERID                                       PHASE     AGE    VERSION
sc201vc17pace-control-plane-zt99l              sc201vc17pace   sc201vc17pace-control-plane-zt99l              vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt   vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp   vsphere://4201160b-21c9-ccc2-6826-e3545e34b490   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5   vsphere://420125a8-e45c-04b7-5612-ce3149e86d74   Running   139d   v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   sc201vc17pace   sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv   vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30   Running   139d   v1.20.9+vmware.1

❯ kgtkca | grep karvea
karvea-vc17ns11                             sc201vc17pace           running    2021-11-19T12:17:24Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

Note: The above case is a sample scenario and the reasons why the TKC is stuck at updating may vary based on several conditions. This is a generic method one can follow while approaching these kind of issues.

Hope it was useful. Cheers!

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.

Case 1:

TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.

Following is a sample log for this issue from the vmop-controller-manger:

Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error

This can be resolved by just editing the content library and accepting new certificate thumbprint.

Case 2:

Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.

❯ kubectl get tkr
No resources found

This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.

If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.

You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.

Case 3:

TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:

E0803 18:51:30.638787       1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638821       1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638851       1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.639301       1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"

This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:

get service cm-inventory  
restart service cm-inventory

Case 4:

Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.

> Get-ContentLibrary | select Name,Id | fl

Name : wdc-01-vc18c01-wcp
Id   : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d

> kg contentsources
NAME                                   AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db   173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d   321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662   242d
75e0668c-0cdc-421e-965d-fd736187cc57   173d
818c8700-efa4-416b-b78f-5f22e9555952   173d
9abbd108-aeb3-4b50-b074-9e6c00473b02   173d
a6cd1685-49bf-455f-a316-65bcdefac7cf   173d
acff9a91-0966-4793-9c3a-eb5272b802bd   242d
fcc08a43-1555-4794-a1ae-551753af9c03   173d

In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.

Hope it was useful. Cheers!

Saturday, June 25, 2022

GitOps using Argo CD - Part1

In this article we will see how to use Git and Argo CD to deploy/ manage applications on your Kubernetes cluster. Before that, what is GitOps? Simply put, Operations driven using Git and CD tools! Following are the major components of GitOps:

Infrastructure as code (IaC)
Merge/ pull requests as change agent
Continuous Delivery tool (Example: Argo CD)

Basically you keep all your application deployment manifests in a Git repository. And, if you like to make changes to your application, you create a merge/ pull request. Once it is approved and merged to the main branch a continuous delivery/ deployment tool like Argo CD will identify that and deploys the latest change to the target Kubernetes cluster. Here I am using a Tanzu Kubernetes Cluster with 1 control plane node and 3 worker nodes. I will be deploying Argo CD as well as my application on this K8s cluster.

❯ kubectl get node
NAME                                STATUS   ROLES                  AGE   VERSION
gc-control-plane-rhpmq              Ready    control-plane,master   34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-692n5   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-jfzrs   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-xvjsh   Ready    <none>                 34d   v1.21.6+vmware.1

Lets create a namespace first.

❯ kubectl create namespace argocd
namespace/argocd created

❯ kubectl apply -f https://gist.githubusercontent.com/vineethac/dafa5b47afd674a1a9f7be2ce773a2bd/raw/4591e837098043b8095a5c48614e1c94b5ca2b44/tkg-psp.yml
clusterrole.rbac.authorization.k8s.io/psp:privileged created
clusterrolebinding.rbac.authorization.k8s.io/all:psp:privileged created

Following yaml file will install Argo CD:

❯ kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

❯ kubectl get all -n argocd
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/argocd-application-controller-0                     1/1     Running   0          20h
pod/argocd-applicationset-controller-5f7d8fffb7-82xgp   1/1     Running   0          20h
pod/argocd-dex-server-75f7cff9cd-7vc64                  1/1     Running   0          20h
pod/argocd-notifications-controller-69bf646f87-8bt5n    1/1     Running   0          20h
pod/argocd-redis-748569f956-bskfw                       1/1     Running   0          19h
pod/argocd-repo-server-8699756b5d-7qmx2                 1/1     Running   0          20h
pod/argocd-server-6dd9cd7964-gbfm4                      1/1     Running   0          20h

NAME                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-applicationset-controller          ClusterIP   10.106.103.231   <none>        7000/TCP,8080/TCP            20h
service/argocd-dex-server                         ClusterIP   10.103.79.207    <none>        5556/TCP,5557/TCP,5558/TCP   20h
service/argocd-metrics                            ClusterIP   10.105.254.212   <none>        8082/TCP                     20h
service/argocd-notifications-controller-metrics   ClusterIP   10.97.254.140    <none>        9001/TCP                     20h
service/argocd-redis                              ClusterIP   10.97.244.161    <none>        6379/TCP                     20h
service/argocd-repo-server                        ClusterIP   10.101.181.242   <none>        8081/TCP,8084/TCP            20h
service/argocd-server                             ClusterIP   10.105.76.149    <none>        80/TCP,443/TCP               20h
service/argocd-server-metrics                     ClusterIP   10.100.168.241   <none>        8083/TCP                     20h

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/argocd-applicationset-controller   1/1     1            1           20h
deployment.apps/argocd-dex-server                  1/1     1            1           20h
deployment.apps/argocd-notifications-controller    1/1     1            1           20h
deployment.apps/argocd-redis                       1/1     1            1           20h
deployment.apps/argocd-repo-server                 1/1     1            1           20h
deployment.apps/argocd-server                      1/1     1            1           20h

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/argocd-applicationset-controller-5f7d8fffb7   1         1         1       20h
replicaset.apps/argocd-dex-server-75f7cff9cd                  1         1         1       20h
replicaset.apps/argocd-notifications-controller-69bf646f87    1         1         1       20h
replicaset.apps/argocd-redis-748569f956                       1         1         1       19h
replicaset.apps/argocd-repo-server-8699756b5d                 1         1         1       20h
replicaset.apps/argocd-server-6dd9cd7964                      1         1         1       20h

NAME                                             READY   AGE
statefulset.apps/argocd-application-controller   1/1     20h

Lets port forward, so that you can access the Argo CD web UI.

❯ kubectl port-forward -n argocd service/argocd-server 8080:443
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Now you can access Argo CD in your web browser at https://localhost:8080/

username: admin
password: you can decode it from the following secret

❯ kubectl get secret argocd-initial-admin-secret -n argocd -o yaml
apiVersion: v1
data:
password: TTFSRXJJZ0NKbHc2Y2JINA==
kind: Secret
metadata:
creationTimestamp: "2022-06-22T13:00:14Z"
name: argocd-initial-admin-secret
namespace: argocd
resourceVersion: "39224285"
uid: d3b7c82e-0c92-4418-95eb-95a73fe674b6
type: Opaque

❯ echo TTFSRXJJZ0NKbHc2Y2JINA== | base64 --decode

Next step is to create a repository and add your application yaml files to it. You will also need to create a Argo CD application yaml file, push it to the repository, and then apply the same application yaml file to your cluster.

Following is a screenshot of my repo:

We have the application.yaml file and then inside the dev folder I have two yaml files.

Now, we also have the application yaml file. Make sure to paste your git repo url in the repoURL field.

Lets apply the application yaml manifest on to the Kubernetes cluster and that will connect Argo CD with your git repo.

Note: If you using a private repo, then you need to add your repository to Argo CD first and connect with respective credentials.

> kubectl apply -f /Users/vineetha/myrepo/argocd/application.yaml

Once the application yaml is applied, after few seconds you can see nginx pod and svc getting deployed automatically under myapp namespace.

❯ kubectl get pods,deployment,svc -n myapp
NAME                            READY   STATUS    RESTARTS   AGE
pod/my-nginx-74d7c6cb98-tzdfp   1/1     Running   0          2d

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-nginx   1/1     1            1           2d

NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/my-nginx   ClusterIP   10.109.67.35   <none>        80/TCP    2d

From now on if you like to make changes to your application, say, you want to scale the replicas from 1 to 3, you have to create a merge request or pull request to the repository with the respective change, and once its approved and merged to the main branch Argo CD will automatically detect it and pull it and apply it to your Kubernetes cluster.

Hope it was useful. Cheers!

Reference video by Nana https://twitter.com/Njuchi_

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces

Following code snipet uses Python client for the kubernetes API to get namespace details from a given context:

from kubernetes import client, config
import argparse


def load_kubeconfig(context_name):
    config.load_kube_config(context=f"{context_name}")
    v1 = client.CoreV1Api()
    return v1


def get_all_namespace(v1):
    print("Listing namespaces with their creation timestamp, and status:")
    ret = v1.list_namespace()
    for i in ret.items:
        print(i.metadata.name, i.metadata.creation_timestamp, i.status.phase)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    args = parser.parse_args()

    context = args.context
    v1 = load_kubeconfig(context)
    get_all_namespace(v1)


if __name__ == "__main__":
    main()

Saturday, May 21, 2022

vSphere with Tanzu using NSX-T - Part15 - Working with etcd on TKC with one control plane

In this article, we will see how to work with etcd database of a Tanzu Kubernetes Cluster (TKC) with one control plane node and perform some basic operations. Following is a TKC with one control plane node and three worker nodes:

Get K8s cluster nodes

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE     VERSION
gc-control-plane-6g9gk              Ready    control-plane,master   3d7h    v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-n5qp8   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-wds2m   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-z2wvt   Ready    <none>                 7d19h   v1.21.6+vmware.1

Get the etcd pod and describe it

❯ gcc kg pod -A | grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          3d7h
❯
❯ gcc kd pod etcd-gc-control-plane-6g9gk -n kube-system
Name:                 etcd-gc-control-plane-6g9gk
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gc-control-plane-6g9gk/100.68.36.38
Start Time:           Tue, 19 Jul 2022 11:14:25 +0530
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
                      kubernetes.io/config.hash: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.mirror: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.seen: 2022-07-19T05:44:19.416549595Z
                      kubernetes.io/config.source: file
                      kubernetes.io/psp: vmware-system-privileged
Status:               Running
IP:                   100.68.36.38
IPs:
  IP:           100.68.36.38
Controlled By:  Node/gc-control-plane-6g9gk
Containers:
  etcd:
    Container ID:  containerd://253c7b25bd60ea78dfccad52d03534785f0d7b7a1fa7105dbd55d7727f8785c3
    Image:         localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    Image ID:      sha256:78661ebbe1adaee60336a0f8ff031c4537ff309ef51feab6e840e7dbb3cbf47d
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://100.68.36.38:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://100.68.36.38:2380
      --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
      --initial-cluster-state=existing
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://100.68.36.38:2380
      --name=gc-control-plane-6g9gk
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Tue, 19 Jul 2022 11:14:27 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     100Mi
    Liveness:     http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:  <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:            <none>

Exec into the etcd pod and run etcdctl commands

You can use etcdctl and you need to provide cacert, cert, and key details. All these info you will get while describing the etcd pod.

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
❯
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint health --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --write-out json"
[{"endpoint":"127.0.0.1:2379","health":true,"took":"9.689387ms"}]
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key -w json"
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":4073335150581888229,"member_id":14250600431682432472,"revision":2153804,"raft_term":11},"version":"3.4.13","dbSize":24719360,"leader":14250600431682432472,"raftIndex":2429139,"raftTerm":11,"raftAppliedIndex":2429139,"dbSizeInUse":2678784}}]

Snapshot etcd

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot save snapshotdb-$(date +%d-%m-%y)"
{"level":"info","ts":1658651541.2698474,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb-24-07-22.part"}
{"level":"info","ts":"2022-07-24T08:32:21.277Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1658651541.2771788,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-07-24T08:32:21.594Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1658651541.621639,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"25 MB","took":0.344746859}
{"level":"info","ts":1658651541.621852,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb-24-07-22"}
Snapshot saved at snapshotdb-24-07-22
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh
# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  snapshotdb-24-07-22  srv  sys  tmp  usr	var
#
# exit
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot status snapshotdb-24-07-22 -w table"
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+
❯

Copy snapshot file from etcd pod to local machine

Note: Even though there was an error while copying the snapshot file from the pod to local machine, you can see the file was successfully copied and I also verified the snapshot file status using etcdctl. Every field (hash, total_keys, etc.) matches with that of the source file.

❯ gcc kubectl cp kube-system/etcd-gc-control-plane-6g9gk:/snapshotdb-24-07-22 snapshotdb-24-07-22
tar: Removing leading `/' from member names
error: unexpected EOF
❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+       
❯

Restore etcd

We can restore the etcd snapshot using etcdctl from the TKC control plane node. Inorder to connect to the control plane VM, we need to create a jumpbox pod under the corresponding supervisor namespace.

So, first copy the snapshot file from local machine to jumpbox pod.

❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ kubectl cp snapshotdb-24-07-22 vineetha-test04-deploy/jumpbox01:/
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- sh
sh-4.4# su
root [ / ]# ls
bin   dev  home  lib64  mnt   root  sbin                 srv  tmp  var
boot  etc  lib   media  proc  run   snapshotdb-24-07-22  sys  usr
root [ / ]#

Copy the snapshot file from jumpbox pod to control plane node.

❯ gcc kg po -A -o wide| grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          166m   100.68.36.38   gc-control-plane-6g9gk              <none>           <none>
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- scp /snapshotdb-24-07-22 vmware-system-user@100.68.36.38:/tmp
snapshotdb-24-07-22                           100%   20MB 126.1MB/s   00:00
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- /usr/bin/ssh vmware-system-user@100.68.36.38
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Sun Jul 24 13:02:39 2022 from 100.68.35.210
 13:14:29 up 5 days,  7:38,  0 users,  load average: 0.98, 0.53, 0.31

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ sudo su
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# cd /tmp/

Install etcd on the control plane node, so that we get to access etcdctl utility.

root [ /tmp ]# tdnf install etcd
root [ /tmp ]# ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
root [ /tmp ]#
root [ /tmp ]# ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
Deprecated: Use `etcdutl snapshot status` instead.

b0910e83, 2434362, 1580, 25 MB
root [ /tmp ]# hostname
gc-control-plane-6g9gk
root [ /tmp ]# ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot restore /tmp/snapshotdb-24-07-22 --data-dir=/var/lib/etcd-backup --skip-hash-check=true
Deprecated: Use `etcdutl snapshot restore` instead.

2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:251	restoring snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/snapshot/v3_snapshot.go:257\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/command/snapshot_command.go:128\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/main.go:59\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
2022-07-24T13:20:44Z	info	membership/store.go:141	Trimming membership information from the backend...
2022-07-24T13:20:44Z	info	membership/cluster.go:421	added member	{"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]}
2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:272	restored snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap"}
root [ /var/lib ]#

We have restored the database snapshot to a new location: --data-dir=/var/lib/etcd-backup. So we need to modify the etcd-data hostpath to path: /var/lib/etcd-backup in the etcd static pod manifest file (etcd.yaml). Copy the contents of etcd.yaml file.

root [ /var/lib ]# cd /etc/kubernetes/manifests/
root [ /etc/kubernetes/manifests ]# ls
etcd.yaml            kube-controller-manager.yaml  registry.yaml
kube-apiserver.yaml  kube-scheduler.yaml
root [ /etc/kubernetes/manifests ]# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://100.68.36.38:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://100.68.36.38:2380
    - --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://100.68.36.38:2380
    - --name=gc-control-plane-6g9gk
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

I was having difficulties in the terminal to modify it. So I copied the contents of etcd.yaml file locally, modified the path, removed the existing etcd.yaml file, created new etcd.yaml file, and pasted the modifed content in it.

root [ /etc/kubernetes/manifests ]# rm etcd.yaml
root [ /etc/kubernetes/manifests ]# vi etcd.yaml

<paste the above etcd.yaml file contents, with modified etcd-data hostPath, last part of the yaml will look like below:

 volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd-backup
      type: DirectoryOrCreate
    name: etcd-data
status: {}

>

Once the etcd.yaml is saved, after few seconds you can see that etcd pod will be running.

root [ /etc/kubernetes/manifests ]# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
92dad43a85ebc       78661ebbe1ada       1 second ago        Running             etcd                           0                   8258804cf17bb
5c704092eb4bb       25605c4ab20fe       10 seconds ago      Running             csi-resizer                    10                  1fa9feb732df5
495b5ff250cb4       05cfd9e3c3f22       10 seconds ago      Running             csi-provisioner                22                  1fa9feb732df5
c0da43af7f1d0       a145efcc3afb4       11 seconds ago      Running             vsphere-syncer                 10                  1fa9feb732df5
4e6d67dc16f4a       5cb2119a4d797       11 seconds ago      Running             kube-controller-manager        11                  3d1c266aa5e24
8710a7b6a8563       fa70d7ee973ad       11 seconds ago      Running             guest-cluster-cloud-provider   35                  3f3d5eb0929e5
4e0c1e2e72682       f18cde23836f5       11 seconds ago      Running             csi-attacher                   10                  1fa9feb732df5
bf6771ca4dc4d       a609b91a17410       11 seconds ago      Running             kube-scheduler                 11                  f50eada3f8127
05fa3f2f587e8       fa70d7ee973ad       3 hours ago         Exited              guest-cluster-cloud-provider   34                  3f3d5eb0929e5
888ba7ce34d92       3f6d2884f8105       3 hours ago         Running             kube-apiserver                 28                  c368bd9937f8b
6579ffffa5f53       382a8821c56e0       3 hours ago         Running             metrics-server                 21                  622c835648008
bdd0c760d0bc7       382a8821c56e0       3 hours ago         Exited              metrics-server                 20                  622c835648008
6063485d73e38       78661ebbe1ada       3 hours ago         Exited              etcd                           0                   71e6ae04726ad
259595a330d26       3f6d2884f8105       3 hours ago         Exited              kube-apiserver                 27                  c368bd9937f8b
5034f4ea18f1f       25605c4ab20fe       4 hours ago         Exited              csi-resizer                    9                   1fa9feb732df5
21de3c4850dc3       05cfd9e3c3f22       4 hours ago         Exited              csi-provisioner                21                  1fa9feb732df5
d623946b1d270       a145efcc3afb4       4 hours ago         Exited              vsphere-syncer                 9                   1fa9feb732df5
5f50b9e93d287       a609b91a17410       4 hours ago         Exited              kube-scheduler                 10                  f50eada3f8127
ec4e066f54fd6       5cb2119a4d797       4 hours ago         Exited              kube-controller-manager        10                  3d1c266aa5e24
f05fee251a700       f18cde23836f5       4 hours ago         Exited              csi-attacher                   9                   1fa9feb732df5
d3577bf8477d0       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   1fa9feb732df5
d30ba30b8c203       4251b7012fd43       5 days ago          Running             vsphere-csi-controller         0                   1fa9feb732df5
1816ed6aada5f       02abc4bd595a0       5 days ago          Running             guest-cluster-auth-service     0                   bff70bcd389be
3e014a6745f5a       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   66e5ca3abfe5b
ba82f29cf8939       4251b7012fd43       5 days ago          Running             vsphere-csi-node               0                   66e5ca3abfe5b
9c29960956718       f3fe18dd8cea2       5 days ago          Running             node-driver-registrar          0                   66e5ca3abfe5b
92aa3a904d72e       0515f8357a522       5 days ago          Running             antrea-agent                   3                   4ae63a9f8e4cb
9e0f32fa88663       0515f8357a522       5 days ago          Exited              antrea-agent                   2                   4ae63a9f8e4cb
ac5573a4d93f8       0515f8357a522       5 days ago          Running             antrea-ovs                     0                   4ae63a9f8e4cb
e14fea2c37c21       0515f8357a522       5 days ago          Exited              install-cni                    0                   4ae63a9f8e4cb
166627360a434       7fde82047d4f6       5 days ago          Running             docker-registry                0                   c84c550a9ab90
ecfbdcd23858d       f31127f4a3471       5 days ago          Running             kube-proxy                     0                   07f5b9be02414
root [ /etc/kubernetes/manifests ]# exit
exit
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ exit
logout

Verify

In my case I had a namespace vineethac-testing with two nginx pods running under it while the snapshot was taken. After the snapshot was taken, I deleted the two nginx pods and the namespace vineethac-testing. After restoring the etcd snapshot, I can see that the namespace vineethac-testing is active with two nginx pods under it.

❯ gcc kg ns
NAME                           STATUS   AGE
default                        Active   9d
kube-node-lease                Active   9d
kube-public                    Active   9d
kube-system                    Active   9d
vineethac-testing              Active   4h28m
vmware-system-auth             Active   9d
vmware-system-cloud-provider   Active   9d
vmware-system-csi              Active   9d
❯
❯ gcc kg pods -n vineethac-testing
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          4h25m
nginx2   1/1     Running   0          4h23m

Hope it was useful. Cheers!

Note: I've tested this in a lab. This may not be the best practice procedure and may slightly vary in a real world environment.

Friday, April 8, 2022

Working with Kubernetes using Python - Part 03 - Get nodes

Following code snipet uses kubeconfig python module to switch context and Python client for the kubernetes API to get cluster node details. It takes the default kubeconfig file, and switch to the required context, and get node info of the respective cluster.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>
kubectl get nodes -o json

Code:

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/
https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Pages

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

Following is a sample log for this issue from the vmop-controller-manger:

Case 3:

Saturday, June 25, 2022

GitOps using Argo CD - Part1

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces

Saturday, May 21, 2022

vSphere with Tanzu using NSX-T - Part15 - Working with etcd on TKC with one control plane

Friday, April 8, 2022

Working with Kubernetes using Python - Part 03 - Get nodes