Showing posts with label PVC. Show all posts
Showing posts with label PVC. Show all posts

Sunday, July 9, 2023

vSphere with Tanzu using NSX-T - Part27 - nullfinalizer kubectl plugin

I have seen many cases where the supervisor namespace gets stuck at Terminating phase waiting on finalization on some of its child resources. This plugin can be used for setting finalizer to null for all objects of a specified api resource under a supervisor namespace. It will be helpful in cleaning up supervisor namespaces stuck terminating phase and can be also used to clean up stale resources under a supervisor namespace.

kubectl-nullfinalizer

#!/bin/bash

Help()
{
   # Display Help
   echo "This plugin sets finalizer to null for specified resource in a namespace."
   echo "Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME"
   echo "Example: kubectl nullfinalizer vineetha-svns01 pvc"
}

# Get the options
while getopts ":h" option; do
   case $option in
      h) # display Help
         Help
         exit;;
     \?) # incorrect option
         echo "Error: Invalid option"
         exit;;
   esac
done

kubectl get -n $1 $2 --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n $1 $2 {} -p '{"metadata":{"finalizers": null}}' --type=merge

Usage

  • Place the plugin in the system executable path.
  • I placed it in $HOME/.krew/bin in my laptop.
  • Once you copied the plugin to the proper path, you can make it executable by: chmod 755 kubectl-nullfinalizer .
  • After that you should be able to run the plugin as: kubectl nullfinalizer SUPERVISORNAMESPACE RESOURCENAME .


Example

Following is an exmaple of a supervisor namespace stuck at Terminating phase. While describe you can see that it is waiting on finalization. 

❯ k config current-context
wdc-08-vc07
❯ kg ns svc-sct-bot-dogfooding
NAME                     STATUS        AGE
svc-sct-bot-dogfooding   Terminating   584d

❯ kg ns svc-sct-bot-dogfooding -oyaml

status:
  conditions:
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: 'Some resources are remaining: clusters.cluster.x-k8s.io has 1 resource
      instances, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io has 1 resource
      instances, machines.cluster.x-k8s.io has 4 resource instances, persistentvolumeclaims.
      has 9 resource instances, projects.registryagent.vmware.com has 1 resource instances,
      tanzukubernetesclusters.run.tanzu.vmware.com has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: 'Some content in the namespace has finalizers remaining: cluster.cluster.x-k8s.io
      in 1 resource instances, cns.vmware.com/pvc-protection in 9 resource instances,
      controller-finalizer in 1 resource instances, kubeadm.controlplane.cluster.x-k8s.io
      in 1 resource instances, machine.cluster.x-k8s.io in 4 resource instances, tanzukubernetescluster.run.tanzu.vmware.com
      in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating

❯ kg pvc -n svc-sct-bot-dogfooding
NAME                                 STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
gc1-workers-r9jvb-4sfjc-containerd   Terminating   pvc-0d9f4a38-86ad-41d8-ab11-08707780fd85   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc1-workers-r9jvb-szg9r-containerd   Terminating   pvc-ca6b6ec4-85fa-464c-abc6-683358994f3f   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc1-workers-r9jvb-zbdt8-containerd   Terminating   pvc-8f2b0683-ebba-46cb-a691-f79a0e94d0e2   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc2-workers-vpzl2-ffkgx-containerd   Terminating   pvc-69e64099-42c8-44b5-bef2-2737eca49c36   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc2-workers-vpzl2-hww5v-containerd   Terminating   pvc-5a909482-4c95-42c7-b55a-57372f72e75f   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc2-workers-vpzl2-stsnh-containerd   Terminating   pvc-ed7de540-72f4-4832-8439-da471bf4c892   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc3-workers-2qr4c-64xpz-containerd   Terminating   pvc-38478f19-8180-4b9b-b5a9-8c06f17d0fbc   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc3-workers-2qr4c-dpng5-containerd   Terminating   pvc-a8b12657-10bd-4993-b08e-51b7e9b259f9   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc3-workers-2qr4c-wfvvd-containerd   Terminating   pvc-01c6b224-9dc0-4e03-b87e-641d4a4d0d95   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d

❯ k nullfinalizer -h
This plugin sets finalizer to null for specified resource in a namespace.
Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME
Example: kubectl nullfinalizer vineetha-svns01 pvc


❯ k nullfinalizer svc-sct-bot-dogfooding pvc
persistentvolumeclaim/gc1-workers-r9jvb-4sfjc-containerd patched
persistentvolumeclaim/gc1-workers-r9jvb-szg9r-containerd patched
persistentvolumeclaim/gc1-workers-r9jvb-zbdt8-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-ffkgx-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-hww5v-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-stsnh-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-64xpz-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-dpng5-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-wfvvd-containerd patched


❯ kg projects.registryagent.vmware.com -n svc-sct-bot-dogfooding
NAME                     AGE
svc-sct-bot-dogfooding   584d

❯ k nullfinalizer -h
This plugin sets finalizer to null for specified resource in a namespace.
Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME
Example: kubectl nullfinalizer vineetha-svns01 pvc

❯ k nullfinalizer svc-sct-bot-dogfooding projects.registryagent.vmware.com
project.registryagent.vmware.com/svc-sct-bot-dogfooding patched


❯ kg ns svc-sct-bot-dogfooding
Error from server (NotFound): namespaces "svc-sct-bot-dogfooding" not found

 

Hope it was useful. Cheers!

Saturday, July 30, 2022

vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase

Ideally if everything goes well the TKCs (Tanzu Kubernetes Cluster aka Guest Cluster)  should be in running phase. But sometimes due to several reasons it may be stuck at updating phase. In this article, we will take a sample case and look at troubleshooting/ fixing it. 

Following is an example:

NAMESPACE              NAME                    PHASE      CREATIONTIME           VERSION                           CP    WORKER
karvea-vc17ns11 sc201vc17pace updating 2021-11-19T12:17:24Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4

Lets connect to this TKC. Here I have a small plugin (kubectl-gckc) that generates the TKC kubeconfig and gcc is alias to KUBECONFIG=gckubeconfig, where gckubeconfig is the TKC admin kubeconfig file.
❯ k gckc karvea-vc17ns11 sc201vc17pace
❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz Ready,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw Ready,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1

❯ kg vm -n karvea-vc17ns11
NAME POWERSTATE AGE
sc201vc17pace-control-plane-zt99l poweredOn 139d
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz poweredOn 189d
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw poweredOn 189d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv poweredOn 139d



❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz vsphere://42010982-8b25-ad7b-2a1d-bb949def4834 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1


As you can see above, there are two worker machines that are stuck at Deleting phase. It is because the corresponding two worker nodes are at Ready, SchedulingDisabled status. The nodes are not drained yet due to some reason. Once they get drained properly, its status will be changed to NotReady, SchedulingDisabled. Now lets try to drain those worker nodes manually.
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz", aborting command...

There are pending nodes to be drained:
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
cannot delete Pods with local storage (use --delete-emptydir-data to override): nsxi-platform/kafka-2

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
❯ gcc kg pdb
No resources found in default namespace.
❯ gcc kg pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nsxi-platform kafka N/A 1 0 188d
nsxi-platform zookeeper N/A 1 1 188d


Here this worker  node sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz is not getting drained because of the presence of a pod disruption budget (pdb). So, in-order to drain the node, I am taking a back up of the pdb yaml file and delete it. And once the nodes are drained, I will apply the pdb yaml back on to the cluster.
❯ gcc kg pdb -n nsxi-platform kafka -oyaml > pdb-nsxi-platform-kafka.yaml
❯ code pdb-nsxi-platform-kafka.yaml
❯ gcc kg pdb -n nsxi-platform zookeeper -oyaml > pdb-nsxi-platform-zookeeper.yaml
❯ code pdb-nsxi-platform-zookeeper.yaml

❯ gcc k delete pdb kafka -n nsxi-platform
poddisruptionbudget.policy "kafka" deleted
❯ gcc kg pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nsxi-platform zookeeper N/A 1 1 188d

❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
pod/kafka-2 evicted
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz evicted


❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz drained


❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw", aborting command...

There are pending nodes to be drained:
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw --ignore-daemonsets
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw drained
The worker nodes are now drained.
❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
As soon as the worker nodes are drained, one of them got successfully removed/ deleted, but the other worker node is still present. When we look at the machine resource, you can still see one of the worker machine is still stuck at Deleting phase. In this case I've manually deleted the worker node, still the corresponding worker machine is stuck at Deleting phase.
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1


❯ gcc k delete node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw" deleted

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
Now lets describe the worker machine stuck at Deleting. In this case you can see that there are two PVCs stuck at Terminating status. So I  just edited those two PVCs yaml and set finalizer to null.
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1



❯ kg vm -n karvea-vc17ns11
NAME POWERSTATE AGE
sc201vc17pace-control-plane-zt99l poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv poweredOn 139d


❯ kd machine sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw -n karvea-vc17ns11

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DetectedUnhealthy 13m (x2 over 17m) machinehealthcheck-controller Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has unhealthy node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
Normal SuccessfulDrainNode 13m (x2 over 19m) machine-controller success draining Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
Normal NodeVolumesDetached 12m (x2 over 19m) machine-controller success waiting for node volumes detach Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
Normal MachineMarkedUnhealthy 106s (x4 over 9m58s) machinehealthcheck-controller Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has been marked as unhealthy

❯ kg pvc -n karvea-vc17ns11
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb Bound pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6 Bound pvc-97e6e063-9a9e-4837-9999-284523379453 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624 Bound pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811 Bound pvc-faa7798e-c045-420f-9d09-44674d9d2326 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950 Bound pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447 Bound pvc-49fca2f0-3402-429f-884f-7db9012934d6 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0 Bound pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c Bound pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa Bound pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009 Bound pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c Bound pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
sc201vc17pace-workers-wswdh-2hz8w-containerd Bound pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-5pjrc-containerd Bound pvc-fb162388-4347-4f48-825e-c2c2d62ceb90 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-755m6-containerd Terminating pvc-da2e4866-bb41-4f74-a4b7-0f74bc7061a1 42Gi RWO sc2-01-vc17c01-wcp-mgmt 189d
sc201vc17pace-workers-wswdh-dgmjs-containerd Terminating pvc-64eac528-f160-444c-9a0f-0ed9f6393e06 42Gi RWO sc2-01-vc17c01-wcp-mgmt 189d
sc201vc17pace-workers-wswdh-djp2m-containerd Bound pvc-a7542552-de13-4670-ac45-84ed39c3c916 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-flwtt-containerd Bound pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d

As soon as the PVCs are removed, you can see the worker machine that was stuck at Deleting got removed, and the TKC chaged its status to running.
❯ kg pvc -n karvea-vc17ns11
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb Bound pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6 Bound pvc-97e6e063-9a9e-4837-9999-284523379453 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624 Bound pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811 Bound pvc-faa7798e-c045-420f-9d09-44674d9d2326 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950 Bound pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447 Bound pvc-49fca2f0-3402-429f-884f-7db9012934d6 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0 Bound pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c Bound pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa Bound pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009 Bound pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c Bound pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
sc201vc17pace-workers-wswdh-2hz8w-containerd Bound pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-5pjrc-containerd Bound pvc-fb162388-4347-4f48-825e-c2c2d62ceb90 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-djp2m-containerd Bound pvc-a7542552-de13-4670-ac45-84ed39c3c916 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-flwtt-containerd Bound pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d

❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1

❯ kgtkca | grep karvea
karvea-vc17ns11 sc201vc17pace running 2021-11-19T12:17:24Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4

Note: The above case is a sample scenario and the reasons why the TKC is stuck at updating may vary based on several conditions. This is a generic method one can follow while approaching these kind of issues. 
 
Hope it was useful. Cheers!

Sunday, January 30, 2022

vSphere with Tanzu using NSX-T - Part14 - Testing TKC storage using kubestr

In the previous posts we discussed the following:

This article is about using kubestr to test storage options of Tanzu Kubernetes Cluster (TKC). Following are the steps to install kubestr on MAC:

  • wget https://github.com/kastenhq/kubestr/releases/download/v0.4.31/kubestr_0.4.31_MacOS_amd64.tar.gz
  • tar -xvf kubestr_0.4.31_MacOS_amd64.tar.gz 
  • chmod +x kubestr
  • mv kubestr /usr/local/bin

 

Now, lets do kubestr help.

% kubestr help
kubestr is a tool that will scan your k8s cluster
        and validate that the storage systems in place as well as run
        performance tests.

Usage:
  kubestr [flags]
  kubestr [command]

Available Commands:
  browse      Browse the contents of a CSI PVC via file browser
  csicheck    Runs the CSI snapshot restore check
  fio         Runs an fio test
  help        Help about any command

Flags:
  -h, --help             help for kubestr
  -e, --outfile string   The file where test results will be written
  -o, --output string    Options(json)

Use "kubestr [command] --help" for more information about a command.

 

I am going to use the following TKC for testing.

% KUBECONFIG=gc.kubeconfig kubectl get nodes                                            
NAME                               STATUS   ROLES                  AGE    VERSION
gc-control-plane-pwngg             Ready    control-plane,master   103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 103d   v1.20.9+vmware.1

 

Let's run kubestr against the cluster now.

% KUBECONFIG=gc.kubeconfig kubestr                                      

**************************************
  _  ___   _ ___ ___ ___ _____ ___
  | |/ / | | | _ ) __/ __|_   _| _ \
  | ' <| |_| | _ \ _|\__ \ | | |   /
  |_|\_\\___/|___/___|___/ |_| |_|_\

Explore your Kubernetes storage options
**************************************
Kubernetes Version Check:
  Valid kubernetes version (v1.20.9+vmware.1)  -  OK

RBAC Check:
  Kubernetes RBAC is enabled  -  OK

Aggregated Layer Check:
  The Kubernetes Aggregated Layer is enabled  -  OK

W0130 14:17:16.937556   87541 warnings.go:70] storage.k8s.io/v1beta1 CSIDriver is deprecated in v1.19+, unavailable in v1.22+; use storage.k8s.io/v1 CSIDriver
Available Storage Provisioners:

  csi.vsphere.xxxx.com:
    Can't find the CSI snapshot group api version.
    This is a CSI driver!
    (The following info may not be up to date. Please check with the provider for more information.)
    Provider:            vSphere
    Website:             https://github.com/kubernetes-sigs/vsphere-csi-driver
    Description:         A Container Storage Interface (CSI) Driver for VMware vSphere
    Additional Features: Raw Block,<br/><br/>Expansion (Block Volume),<br/><br/>Topology Aware (Block Volume)

    Storage Classes:
      * sc2-01-vc16c01-wcp-mgmt

    To perform a FIO test, run-
      ./kubestr fio -s <storage class>

 

 

You can run storage tests using kubestr and it uses FIO for generating IOs. For example this is how you can run a basic storage test.

% KUBECONFIG=gc.kubeconfig kubestr fio -s sc2-01-vc16c01-wcp-mgmt -z 10G
PVC created kubestr-fio-pvc-zvdhr
Pod created kubestr-fio-pod-kdbs5
Running FIO test (default-fio) on StorageClass (sc2-01-vc16c01-wcp-mgmt) with a PVC of Size (10G)
Elapsed time- 29.290421119s
FIO test results:
 
FIO version - fio-3.20
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=3987.150391 BW(KiB/s)=15965
  iops: min=3680 max=4274 avg=3992.034424
  bw(KiB/s): min=14720 max=17096 avg=15968.827148

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=3562.628906 BW(KiB/s)=14267
  iops: min=3237 max=3750 avg=3565.896484
  bw(KiB/s): min=12950 max=15000 avg=14264.862305

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=2988.549316 BW(KiB/s)=383071
  iops: min=2756 max=3252 avg=2992.344727
  bw(KiB/s): min=352830 max=416256 avg=383056.187500

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=2754.796143 BW(KiB/s)=353151
  iops: min=2480 max=2992 avg=2759.586182
  bw(KiB/s): min=317440 max=382976 avg=353242.781250

Disk stats (read/write):
  sdd: ios=117160/105647 merge=0/1210 ticks=2100090/2039676 in_queue=4139076, util=99.608589%
  -  OK

As you can see, a PVC of 10G, a FIO pod will be created, and this will be used for the FIO test. Once the test is complete, the PVC and FIO pod will be deleted automatically. 

I hope it was useful. Cheers!


Thursday, July 9, 2020

Tanzu Kubernetes Grid (TKG) on vSphere 6.7 U3 - Part3

In this blog, I will explain how to deploy an FIO application pod with persistent storage on your Tanzu Kubernetes workload cluster.

Step 1: Deploy a K8s workload cluster

tkg create cluster <cluster name> --plan=dev


Now the workload K8s cluster is deployed with a Master, LB, and Worker node.