Following are some of the kubectl plugins that I use on a daily basis:
Saturday, June 17, 2023
Kubernetes 101 - Part10 - Plugins I use for managing K8s clusters
Friday, June 9, 2023
vSphere with Tanzu using NSX-T - Part26 - Jumpbox kubectl plugin to SSH to TKC node
For troubleshooting TKC (Tanzu Kubernetes Cluster) you may need to ssh into the TKC nodes. For doing ssh, you will need to first create a jumpbox pod under the supervisor namespace and from there you can ssh to the TKC nodes.
Here is the manual procedure: https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html
Following kubectl plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs.
kubectl-jumpbox
#!/bin/bash Help() { # Display Help echo "Description: This plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs." echo "Usage: kubectl jumpbox SVNAMESPACE TKCNAME" echo "Example: k exec -it jumpbox-tkc1 -n svns1 -- /usr/bin/ssh vmware-system-user@VMIP" } # Get the options while getopts ":h" option; do case $option in h) # display Help Help exit;; \?) # incorrect option echo "Error: Invalid option" exit;; esac done kubectl create -f - <<EOF apiVersion: v1 kind: Pod metadata: name: jumpbox-$2 namespace: $1 #REPLACE spec: containers: - image: "photon:3.0" name: jumpbox command: [ "/bin/bash", "-c", "--" ] args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ] volumeMounts: - mountPath: "/root/ssh" name: ssh-key readOnly: true resources: requests: memory: 2Gi volumes: - name: ssh-key secret: secretName: $2-ssh #REPLACE YOUR-CLUSTER-NAME-ssh EOF
Usage
- Place the plugin in the system executable path.
- I placed it in $HOME/.krew/bin directory in my laptop.
- Once you copied the plugin to the proper path, you can make it executable by: chmod 755 kubectl-jumpbox
- After that you should be able to run the plugin as: kubectl jumpbox SUPERVISORNAMESPACE TKCNAME
Example
❯ kg tkc -n vineetha-dns1-test NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE tkc 1 3 v1.21.6---vmware.1-tkg.1.b3d708a 213d True True [1.22.9+vmware.1-tkg.1.cc71bc8] tkc-using-cci-ui 1 1 v1.23.8---vmware.3-tkg.1 37d True True ❯ ❯ kg po -n vineetha-dns1-test NAME READY STATUS RESTARTS AGE nginx-test 1/1 Running 0 29d ❯ ❯ ❯ kubectl jumpbox vineetha-dns1-test tkc pod/jumpbox-tkc created ❯ ❯ kg po -n vineetha-dns1-test NAME READY STATUS RESTARTS AGE jumpbox-tkc 0/1 Pending 0 8s nginx-test 1/1 Running 0 29d ❯ ❯ kg po -n vineetha-dns1-test NAME READY STATUS RESTARTS AGE jumpbox-tkc 1/1 Running 0 21s nginx-test 1/1 Running 0 29d ❯ ❯ k jumpbox -h Description: This plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs. Usage: kubectl jumpbox SVNAMESPACE TKCNAME Example: k exec -it jumpbox-tkc1 -n svns1 -- /usr/bin/ssh vmware-system-user@VMIP ❯ ❯ kg vm -n vineetha-dns1-test -o wide NAME POWERSTATE CLASS IMAGE PRIMARY-IP AGE tkc-control-plane-8rwpk poweredOn best-effort-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 172.29.0.7 133d tkc-using-cci-ui-control-plane-z8fkt poweredOn best-effort-small ob-20953521-tkgs-ova-photon-3-v1.23.8---vmware.3-tkg.1 172.29.13.130 37d tkc-using-cci-ui-tkg-cluster-nodepool-9nf6-n6nt5-b97c86fb45mvgj poweredOn best-effort-small ob-20953521-tkgs-ova-photon-3-v1.23.8---vmware.3-tkg.1 172.29.13.131 37d tkc-workers-zbrnv-6c98dd84f9-52gn6 poweredOn best-effort-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 172.29.0.6 133d tkc-workers-zbrnv-6c98dd84f9-d9mm7 poweredOn best-effort-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 172.29.0.8 133d tkc-workers-zbrnv-6c98dd84f9-kk2dg poweredOn best-effort-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 172.29.0.3 133d ❯ ❯ k exec -it jumpbox-tkc -n vineetha-dns1-test -- /usr/bin/ssh vmware-system-user@172.29.0.7 The authenticity of host '172.29.0.7 (172.29.0.7)' can't be established. ECDSA key fingerprint is SHA256:B7ptmYm617lFzLErJm7G5IdT7y4SJYKhX/OenSgguv8. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '172.29.0.7' (ECDSA) to the list of known hosts. Welcome to Photon 3.0 (\m) - Kernel \r (\l) 13:06:06 up 133 days, 4:46, 0 users, load average: 0.23, 0.33, 0.27 36 Security notice(s) Run 'tdnf updateinfo info' to see the details. vmware-system-user@tkc-control-plane-8rwpk [ ~ ]$ sudo su root [ /home/vmware-system-user ]# root [ /home/vmware-system-user ]#
Hope it was useful. Cheers!
Saturday, May 20, 2023
vSphere with Tanzu using NSX-T - Part25 - Spherelet
Following are the steps to verify the status of spherelet service, and restart them if required.
Example:
❯ kubectx wdc-01-vcxx Switched to context "wdc-01-vcxx". ❯ kubectl get node NAME STATUS ROLES AGE VERSION 42019f7e751b2818bb0c659028d49fdc Ready control-plane,master 317d v1.22.6+vmware.wcp.2 4201b0b21aed78d8e72bfb622bb8b98b Ready control-plane,master 317d v1.22.6+vmware.wcp.2 4201c53dcef2701a8c36463942d762dc Ready control-plane,master 317d v1.22.6+vmware.wcp.2 wdc-01-rxxesx04.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx05.xxxxxxxxx.com NotReady,SchedulingDisabled agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx06.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx32.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx33.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx34.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx35.xxxxxxxxx.com Ready,SchedulingDisabled agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx36.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx37.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx38.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx39.xxxxxxxxx.com NotReady,SchedulingDisabled agent 317d v1.22.6-sph-db56d46 wdc-01-rxxesx40.xxxxxxxxx.com Ready agent 317d v1.22.6-sph-db56d46
Logs
- ssh into the ESXi worker node.
tail -f /var/log/spherelet.log
Status
- ssh into the ESXi worker node and run the following:
etc/init.d/spherelet status
- You can check status of spherelet using PowerCLI. Following is an example:
> Connect-VIServer wdc-10-vcxx > Get-VMHost | Get-VMHostService | where {$_.Key -eq "spherelet"} | select VMHost,Key,Running | ft VMHost Key Running ------ --- ------- wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
Restart
- ssh into the ESXi worker node and run the following:
/etc/init.d/spherelet restart
- You can also restart spherelet service using PowerCLI. Following is an example to restart spherelet service on ALL the ESXi worker nodes of a cluster:
> Get-Cluster Name HAEnabled HAFailover DrsEnabled DrsAutomationLevel Level ---- --------- ---------- ---------- ------------------ wdc-10-vcxxc01 True 1 True FullyAutomated > Get-Cluster -Name wdc-10-vcxxc01 | Get-VMHost | foreach { Restart-VMHostService -HostService ($_ | Get-VMHostService | where {$_.Key -eq "spherelet"}) }
Certificates
- /etc/vmware/spherelet/spherelet.crt
- /etc/vmware/spherelet/client.crt
❯ kg no
NAME STATUS ROLES AGE VERSION
420802008ec0d8ccaa6ac84140768375 Ready control-plane,master 70d v1.22.6+vmware.wcp.2
42087a63440b500de6cec759bb5900bf Ready control-plane,master 77d v1.22.6+vmware.wcp.2
4208e08c826dfe283c726bc573109dbb Ready control-plane,master 77d v1.22.6+vmware.wcp.2
wdc-08-rxxesx25.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx26.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx23.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx24.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx25.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx26.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
[root@wdc-08-rxxesx25:~] openssl x509 -enddate -noout -in /etc/vmware/spherelet/spherelet.crt
notAfter=Sep 1 08:32:24 2023 GMT
[root@wdc-08-rxxesx25:~] openssl x509 -enddate -noout -in /etc/vmware/spherelet/client.crt
notAfter=Sep 1 08:32:24 2023 GMT
Verify
❯ kubectl get node NAME STATUS ROLES AGE VERSION 42017dcb669bea2962da27fc2f6c16d2 Ready control-plane,master 5d20h v1.23.12+vmware.wcp.1 4201b763c766875b77bcb9f04f8840b3 Ready control-plane,master 5d21h v1.23.12+vmware.wcp.1 4201dab068e9b2d3af3b8fde450b3d96 Ready control-plane,master 5d20h v1.23.12+vmware.wcp.1 wdc-01-rxxesx04.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx05.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx06.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx32.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx33.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx34.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx35.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx36.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx37.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx38.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx39.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1 wdc-01-rxxesx40.xxxxxxxxx.com Ready agent 5d19h v1.23.5-sph-81ef5d1
Sunday, May 7, 2023
Kubernetes 101 - Part9 - kubeconfig certificate expiration
You can verify the expiration date of kubeconfig in the current context as follows:
kubectl config view --minify --raw --output 'jsonpath={..user.client-certificate-data}' | base64 -d | openssl x509 -noout -enddate
❯ k config current-context
sc2-01-vcxx
❯
❯ kubectl config view --minify --raw --output 'jsonpath={..user.client-certificate-data}' | base64 -d | openssl x509 -noout -enddate
notAfter=Sep 6 05:13:47 2023 GMT
❯
❯ date
Thu Sep 7 18:05:52 IST 2023
❯
Hope it was useful. Cheers!
Saturday, April 15, 2023
Kubernetes 101 - Part8 - Filter events of a specific object
You can filter events of a specific object as follows:
k get event --field-selector involvedObject.name=<object name> -n <namespace>
➜ k get pods NAME READY STATUS RESTARTS AGE new-replica-set-rx7vk 0/1 ImagePullBackOff 0 101s new-replica-set-gsxxx 0/1 ImagePullBackOff 0 101s new-replica-set-j6xcp 0/1 ImagePullBackOff 0 101s new-replica-set-q8jz5 0/1 ErrImagePull 0 101s ➜ k get event --field-selector involvedObject.name=new-replica-set-q8jz5 -n default LAST SEEN TYPE REASON OBJECT MESSAGE 3m53s Normal Scheduled pod/new-replica-set-q8jz5 Successfully assigned default/new-replica-set-q8jz5 to controlplane 2m33s Normal Pulling pod/new-replica-set-q8jz5 Pulling image "busybox777" 2m33s Warning Failed pod/new-replica-set-q8jz5 Failed to pull image "busybox777": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/busybox777:latest": failed to resolve reference "docker.io/library/busybox777:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed 2m33s Warning Failed pod/new-replica-set-q8jz5 Error: ErrImagePull 2m3s Warning Failed pod/new-replica-set-q8jz5 Error: ImagePullBackOff 110s Normal BackOff pod/new-replica-set-q8jz5 Back-off pulling image "busybox777"
Hope it was useful. Cheers!
Saturday, April 8, 2023
vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC
The Kubernetes component certificates inside a TKC (Tanzu Kubernetes Cluster) has lifetime of 1 year. If you manage to upgrade your TKC atleast once a year, these certs will get rotated automatically.
IMPORTANT NOTES:
- As per this VMware KB, if TKGS Guest Cluster certificates are expired, you will need to engage VMware support to manually rotate them.
- Following troubleshooting steps and workaround are based on studies conducted on my dev/ test/ lab setup, and I will NOT recommend anyone to follow these on your production environment.
Symptom:
❯ KUBECONFIG=tkc.kubeconfig kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid
Troubleshooting:
- Verify the certificate expiry of the tkc kubeconfig file itself.
❯ grep client-certificate-data tkc.kubeconfig | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Mar 8 18:10:15 2022 GMT notAfter=Mar 7 18:26:10 2024 GMT
- Create a jumpbox pod and ssh to TKC control plane nodes.
- Verify system pods and check logs from apiserver and etcd pods. Sample etcd pod logs are given below:
2023-04-11 07:09:00.268792 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z 2023-04-11 07:09:00.268835 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z 2023-04-11 07:09:00.268841 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate 2023-04-11 07:09:00.268869 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate 2023-04-11 07:09:00.310030 I | embed: rejected connection from "172.31.20.27:35362" (error "remote error: tls: bad certificate", ServerName "") 2023-04-11 07:09:00.312806 I | embed: rejected connection from "172.31.20.27:35366" (error "remote error: tls: bad certificate", ServerName "") 2023-04-11 07:09:00.321449 I | embed: rejected connection from "172.31.20.19:35034" (error "remote error: tls: bad certificate", ServerName "") 2023-04-11 07:09:00.322192 I | embed: rejected connection from "172.31.20.19:35036" (error "remote error: tls: bad certificate", ServerName "")
- Verify whether admin.conf inside the control plane node has expired.
root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Mar 8 18:10:15 2022 GMT notAfter=Apr 6 06:05:46 2023 GMT
- Verify Kubernetes component certs in all the control plane nodes.
root [ /etc/kubernetes ]# kubeadm certs check-expiration [check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [check-expiration] Error reading configuration from the Cluster. Falling back to default configuration CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf Apr 06, 2023 06:05 UTC <invalid> no apiserver Apr 06, 2023 06:05 UTC <invalid> ca no apiserver-etcd-client Apr 06, 2023 06:05 UTC <invalid> etcd-ca no apiserver-kubelet-client Apr 06, 2023 06:05 UTC <invalid> ca no controller-manager.conf Apr 06, 2023 06:05 UTC <invalid> no etcd-healthcheck-client Apr 06, 2023 06:05 UTC <invalid> etcd-ca no etcd-peer Apr 06, 2023 06:05 UTC <invalid> etcd-ca no etcd-server Apr 06, 2023 06:05 UTC <invalid> etcd-ca no front-proxy-client Apr 06, 2023 06:05 UTC <invalid> front-proxy-ca no scheduler.conf Apr 06, 2023 06:05 UTC <invalid> no CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca Mar 05, 2032 18:15 UTC 8y no etcd-ca Mar 05, 2032 18:15 UTC 8y no front-proxy-ca Mar 05, 2032 18:15 UTC 8y no
Workaround:
- Renew Kubernetes component certs on control plane nodes if expired using
kubeadm certs renew all
.
root [ /etc/kubernetes ]# kubeadm certs renew all [renew] Reading configuration from the cluster... [renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [renew] Error reading configuration from the Cluster. Falling back to default configuration certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed certificate for serving the Kubernetes API renewed certificate the apiserver uses to access etcd renewed certificate for the API server to connect to kubelet renewed certificate embedded in the kubeconfig file for the controller manager to use renewed certificate for liveness probes to healthcheck etcd renewed certificate for etcd nodes to communicate with each other renewed certificate for serving etcd renewed certificate for the front proxy client renewed certificate embedded in the kubeconfig file for the scheduler manager to use renewed Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
Verify:
- Verify using the following steps on all the TKC control plane nodes.
root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates root [ /etc/kubernetes ]# kubeadm certs check-expiration
- Try connect to the TKC using tkc.kubeconfig.
KUBECONFIG=tkc.kubeconfig kubectl get node
References:
Saturday, March 18, 2023
Kubernetes 101 - Part7 - Restart all deployments and daemonsets in a namespace
Restart all deployments in a namespace
❯ kubectl rollout restart deployments -n <namespace>
Restart all daemonsets in a namespace
❯ kubectl rollout restart daemonsets -n <namespace>
Hope it was useful. Cheers!
Friday, March 10, 2023
Kubernetes 101 - Part6 - Get static pods
Static pods are directly managed by the kubelet on a specific node. More about static pods can be found here: https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/
In
this post we will take a look at how to find all static pods in a
Kubernetes cluster. For a static pod the owner reference kind will be
Node.
custom-columns:
❯ kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,CONTROLLER:'.metadata.ownerReferences[].kind',NAMESPACE:.metadata.namespace | grep Nodejsonpath:
❯
❯ kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,CONTROLLER:'.metadata.ownerReferences[].kind',NAMESPACE:.metadata.namespace | grep Node | wc -l
❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}'Hope it was useful. Cheers!
❯
❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}' | jq
❯
❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}' | jq | grep Node | wc -l
Sunday, September 11, 2022
vSphere with Tanzu using NSX-T - Part19 - Troubleshooting TKC stuck at creating phase
This article provides basic troubleshooting steps for TKCs (Tanzu Kubernetes Cluster) stuck at creating phase.
Verify status of the TKC
- Use the following commands to verify the TKC status.
kubectl get tkc -n <supervisor_namespace>
kubectl get tkc -n <supervisor_namespace> -o json
kubectl describe tkc <tkc_name> -n <supervisor_namespace>
kubectl get cluster-api -n <supervisor_namespace>
kubectl get vm,machine,wcpmachine -n <supervisor_namespace>
Cluster health
- Verify health of the supervisor cluster.
❯ kubectl get node
NAME STATUS ROLES AGE VERSION
4201a7b2667b0f3b021efcf7c9d1726b Ready control-plane,master 86d v1.22.6+vmware.wcp.2
4201bead67e21a8813415642267cd54a Ready control-plane,master 86d v1.22.6+vmware.wcp.2
4201e0e8e29b0ddb4b59d3165dd40941 Ready control-plane,master 86d v1.22.6+vmware.wcp.2
wxx-08-r02esx13.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx14.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx15.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx16.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx17.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx18.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx19.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx20.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx21.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx22.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx23.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx24.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
❯
❯ kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed
Terminating namespaces
- Check for namespaces stuck at terminating phase. If there are any, properly clean them up by removing all child objects.
- You can use this kubectl get-all plugin
to see all resources under a namespace. Then clean them up properly.
Mostly you need to set finalizers of remaining child resources to null.
Following is a sample case where 2 PVCs where stuck at terminating and
they were cleaned up by setting its finalizers to null.
❯ kg ns | grep Terminating
rgettam-gettam Terminating 226d
❯
❯ k get-all -n rgettam-gettam
NAME NAMESPACE AGE
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d rgettam-gettam 86d
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f rgettam-gettam 86d
❯
❯ kg pvc -n rgettam-gettam
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d Terminating pvc-bd4252fb-bfed-4ef3-ab5a-43718f9cbed5 8Gi RWO sxx-01-vcxx-wcp-mgmt 86d
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f Terminating pvc-8bc9daa1-21cf-4af2-973e-af28d66a7f5e 30Gi RWO sxx-01-vcxx-wcp-mgmt 86d
❯
❯ kg pvc -n rgettam-gettam --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n rgettam-gettam pvc {} -p '{"metadata":{"finalizers": null}}'
- You can also do kubectl get namespace <namespace> -oyaml and the status section will show if there are resources/ content to be deleted or any finalizers remaining.
- Verify vmop-controller pod logs, and restart them if required.
IP_BLOCK_EXHAUSTED
- Check CIDR usage of the supervisor cluster.
❯ kg clusternetworkinfos NAME AGE domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4 160d ❯ ❯ kg clusternetworkinfos domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4 -o json | jq .usage { "egressCIDRUsage": { "allocated": 33, "total": 1024 }, "ingressCIDRUsage": { "allocated": 42, "total": 1024 }, "subnetCIDRUsage": { "allocated": 832, "total": 1024 } }
- When the IP blocks of supervisor cluster are exhausted, you will find the following warning when you describe the TKC.
Conditions: Last Transition Time: 2022-10-05T18:34:35Z Message: Cannot realize subnet Reason: ClusterNetworkProvisionFailed Severity: Warning Status: False Type: Ready
- Also when you check the namespace, you can see the following ncp error IP_BLOCK_EXHAUSTED.
❯ kg ns tsql-integration-test -oyaml apiVersion: v1 kind: Namespace metadata: annotations: calaxxxx.xxxyy.com/xxxrole-created: "1" ncp/error: IP_BLOCK_EXHAUSTED ncp/router_id: t1_d0a2af0f-8430-4250-9fcf-807a4afe51aa_rtr vmware-system-resource-pool: resgroup-307480 vmware-system-vm-folder: group-v307481 creationTimestamp: "2022-10-05T17:35:18Z"
Notes:
- If the subnetCIDRUsage IP block is exhausted, you may need to remove some old/ unused namespaces, and that will release some IPs. If that is not possible, you may need to consider adding new subnet.
- After removing the old/ unused namespaces, and even if IPs are available, sometimes the TKCs will be stuck at creating phase! In that case, check the ncp, vmop, and capw controller pods and you may need to restart them. What I observed is usually after restart of ncp pod, vmop-controller pods, and all pods under vmware-system-capw namespaces the VMs will start getting deployed and the TKC creation will progress and complete successfully.
Resource availability
- Check whether there are enough resources available in the cluster.
LAST SEEN TYPE REASON OBJECT MESSAGE 3m23s Warning UpdateFailure virtualmachine/magna3-control-plane-9rhl4 The host does not have sufficient CPU resources to satisfy the reservation. 80s Warning ReconcileFailure wcpmachine/magna3-control-plane-s5s9t-p2cxj vm is not yet powered on: vmware-system-capw-controller-manager/WCPMachine//chakravartha-magna3/magna3/magna3-control-plane-s5s9t-p2cxj
- Check for resource limits applied to the namespace.
Check whether storage policy is assigned to the namespace
27m Warning ReconcileFailure wcpmachine/gc-pool-0-cv8vz-5snbc admission webhook "default.validating.virtualmachine.vmoperator.xxxyy.com" denied the request: StorageClass wdc-10-vc21c01-wcp-pod is not assigned to any ResourceQuotas in namespace mpereiramaia-demo2
- In this case, the storage policy wasnt assigned to the ns. I assigned the storage policy wdc-10-vc21c01-wcp-pod to the respective namespace, and the TKC deployment was successful.
Check Content library can sync properly
- Sometimes issues related to CL can cause TKCs to get stuck at creating phase! Check this blog post for more details.
KCP can't remediate
Message: KCP can't remediate if current replicas are less or equal then 1 Reason: WaitingForRemediation @ Machine/gc-control-plane-zpssc Severity: Warning
- In
this case, you can just edit the TKC spec, change the control plane
vmclass to a different class and save. Once the deployment is complete
and TKC is running, edit the TKC spec again and revert the vmclass that
you modified earlier to its original class. This process will re-provision the control plane.
TKC VMs waiting for IP
- In this case, take a look at NSXT and check whether all Edge nodes are healthy. If there are mismatch errors, resolve them.
- You may also check ncp pod logs and restart ncp pod if required.
VirtualMachineClassBindingNotFound
Conditions: Last Transition Time: 2021-05-05T18:19:10Z Message: 1 of 2 completed Reason: VirtualMachineClassBindingNotFound @ Machine/tkc-dev-control-plane-wxd57 Severity: Error Status: False Message: 0/1 Control Plane Node(s) healthy. 0/2 Worker Node(s) healthy Events: Normal PhaseChanged 7m22s vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller cluster changes from creating phase to failed phase
- This happens when the virtualmachineclassbindings are missing and can be resolved by adding all/ required VM Class to the Namespace using the vSphere Client. Following are the steps to add VM Classes to a namespace:
- Log into vCenter web UI
- From Hosts and Clusters > Select the namespace > Summary tab > VM Service tile > Click Manage VM Classes
- Select all required VM Classes and click OK
Verify NSX-T objects
- Issues at the NSX-T side can also cause the TKC to be stuck at creating phase. Following is a sample case and you can see these logs when you describe the TKC:
Message: 2 errors occurred:
* failed to configure DNS for /, Kind= namespace-test-01/gc: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found * failed to configure kube-proxy for /, Kind= namespace-test-01/gc: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found
- In this case, these were some issues with the virtual servers in loadbalancer. Some stale entries of virtual servers were still present and their IP didn't get removed properly and it was causing some intermittent connectivity issues to some of the other services of type loadbalancer. And, new TKC deployment within that affected namespace also gets stuck due to this. In our case we deleted the affected namespace, and recreated it, that cleaned up all those virtual server state entries and the load balancer, and new TKC deployments were successful. So it will be worth to check on the health and staus of NSX-T objects in case you have TKC deployment issues.
Check for broken TKCs in the cluster
- Sometimes the TKC deployments are very slow and takes more than 30 minutes. In this case, you may notice that the first control plane VM will get deployed in like 30-45 minutes after the TKC creation has started. Look for vmop controller logs. Following is sample log:
❯ kail -n vmware-system-vmop vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:44.725620 1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.9.212:6443: connect: connection refused" "vmName"="ciroscosta-cartographer/kontinue-control-plane-svlk4" "result"=-1 vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:49.888653 1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.2.66:6443: connect: connection refused" "vmName"="whaozhe-platform/gc-control-plane-mf4p5" "result"=-1
- In the above case, two of the TKCs were broken/ stuck at updating phase and we were unable to connect to its control plane.
ciroscosta-cartographer kontinue updating 2021-10-29T18:47:46Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 2 whaozhe-platform gc updating 2022-01-27T03:59:31Z v1.20.12+vmware.1-tkg.1.b9a42f3 1 10
- After removing the namespaces with broken TKCs, new deployments were completing succesfully.
Restart system pods
- Sometimes restart of some of the system controller pods resoves the issue. I usually delete all the pods of the following namespaces and they will get restarted in a few seconds.
k delete pod --all --namespace=vmware-system-vmop
k delete pod --all --namespace=vmware-system-capw
k delete pod --all --namespace=vmware-system-tkg
k delete pod --all --namespace=vmware-system-csi
k delete pod --all --namespace=vmware-system-nsx
Saturday, July 30, 2022
vSphere with Tanzu using NSX-T - Part17 - Troubleshooting TKCs stuck at updating phase
Ideally if everything goes well the TKCs (Tanzu Kubernetes Cluster aka Guest Cluster) should be in running phase. But sometimes due to several reasons it may be stuck at updating phase. In this article, we will take a sample case and look at troubleshooting/ fixing it.
Following is an example:
NAMESPACE NAME PHASE CREATIONTIME VERSION CP WORKER
karvea-vc17ns11 sc201vc17pace updating 2021-11-19T12:17:24Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4
❯ k gckc karvea-vc17ns11 sc201vc17pace
❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz Ready,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw Ready,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
❯ kg vm -n karvea-vc17ns11
NAME POWERSTATE AGE
sc201vc17pace-control-plane-zt99l poweredOn 139d
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz poweredOn 189d
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw poweredOn 189d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv poweredOn 139d
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz vsphere://42010982-8b25-ad7b-2a1d-bb949def4834 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz", aborting command...
There are pending nodes to be drained:
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
cannot delete Pods with local storage (use --delete-emptydir-data to override): nsxi-platform/kafka-2
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod nsxi-platform/kafka-2
error when evicting pods/"kafka-2" -n "nsxi-platform" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
❯ gcc kg pdb
No resources found in default namespace.
❯ gcc kg pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nsxi-platform kafka N/A 1 0 188d
nsxi-platform zookeeper N/A 1 1 188d
❯ gcc kg pdb -n nsxi-platform kafka -oyaml > pdb-nsxi-platform-kafka.yamlThe worker nodes are now drained.
❯ code pdb-nsxi-platform-kafka.yaml
❯ gcc kg pdb -n nsxi-platform zookeeper -oyaml > pdb-nsxi-platform-zookeeper.yaml
❯ code pdb-nsxi-platform-zookeeper.yaml
❯ gcc k delete pdb kafka -n nsxi-platform
poddisruptionbudget.policy "kafka" deleted
❯ gcc kg pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nsxi-platform zookeeper N/A 1 1 188d
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
evicting pod nsxi-platform/kafka-2
pod/kafka-2 evicted
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz evicted
❯
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz --ignore-daemonsets --delete-emptydir-data
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-wqlmq, kube-system/kube-proxy-78z5k, nsxi-platform/nsxi-platform-fluent-bit-pdzjx, projectcontour/projectcontour-envoy-r9pg7, vmware-system-csi/vsphere-csi-node-p2gtd
node/sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz drained
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw", aborting command...
There are pending nodes to be drained:
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
❯ gcc k drain sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw --ignore-daemonsets
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4tz4x, kube-system/kube-proxy-q726d, nsxi-platform/nsxi-platform-fluent-bit-b24nn, projectcontour/projectcontour-envoy-rppkx, vmware-system-csi/vsphere-csi-node-mpbsh
node/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw drained
❯ gcc kg noAs soon as the worker nodes are drained, one of them got successfully removed/ deleted, but the other worker node is still present. When we look at the machine resource, you can still see one of the worker machine is still stuck at Deleting phase. In this case I've manually deleted the worker node, still the corresponding worker machine is stuck at Deleting phase.
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-pn6vz NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw NotReady,SchedulingDisabled <none> 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1
❯ gcc k delete node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw" deleted
❯
❯ gcc kg no
NAME STATUS ROLES AGE VERSION
sc201vc17pace-control-plane-zt99l Ready control-plane,master 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 Ready <none> 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv Ready <none> 139d v1.20.9+vmware.1
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw sc201vc17pace sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw vsphere://4201a640-2b39-3d66-5a26-db95a612f6e5 Deleting 189d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1
❯ kg vm -n karvea-vc17ns11
NAME POWERSTATE AGE
sc201vc17pace-control-plane-zt99l poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 poweredOn 139d
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv poweredOn 139d
❯ kd machine sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw -n karvea-vc17ns11
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DetectedUnhealthy 13m (x2 over 17m) machinehealthcheck-controller Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has unhealthy node sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw
Normal SuccessfulDrainNode 13m (x2 over 19m) machine-controller success draining Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
Normal NodeVolumesDetached 12m (x2 over 19m) machine-controller success waiting for node volumes detach Machine's node "sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw"
Normal MachineMarkedUnhealthy 106s (x4 over 9m58s) machinehealthcheck-controller Machine karvea-vc17ns11/sc201vc17pace-workers-jrcb6/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw/sc201vc17pace-workers-jrcb6-5c7d9548f-w64lw has been marked as unhealthy
❯
❯ kg pvc -n karvea-vc17ns11
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb Bound pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6 Bound pvc-97e6e063-9a9e-4837-9999-284523379453 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624 Bound pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811 Bound pvc-faa7798e-c045-420f-9d09-44674d9d2326 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950 Bound pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447 Bound pvc-49fca2f0-3402-429f-884f-7db9012934d6 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0 Bound pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c Bound pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa Bound pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009 Bound pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c Bound pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
sc201vc17pace-workers-wswdh-2hz8w-containerd Bound pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-5pjrc-containerd Bound pvc-fb162388-4347-4f48-825e-c2c2d62ceb90 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-755m6-containerd Terminating pvc-da2e4866-bb41-4f74-a4b7-0f74bc7061a1 42Gi RWO sc2-01-vc17c01-wcp-mgmt 189d
sc201vc17pace-workers-wswdh-dgmjs-containerd Terminating pvc-64eac528-f160-444c-9a0f-0ed9f6393e06 42Gi RWO sc2-01-vc17c01-wcp-mgmt 189d
sc201vc17pace-workers-wswdh-djp2m-containerd Bound pvc-a7542552-de13-4670-ac45-84ed39c3c916 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-flwtt-containerd Bound pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
❯
❯ kg pvc -n karvea-vc17ns11
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
a366a76b-2000-4d33-a817-a9c1b9e60b1b-1f4b5ee8-f378-445e-97d3-f4c4656863bb Bound pvc-1dc35d76-86c6-4a70-82e7-99609480a0b3 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-3509d39d-e632-492b-a0c4-b5b3874b01a6 Bound pvc-97e6e063-9a9e-4837-9999-284523379453 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-42a0f98e-0f9c-4fc1-bc9f-862e94086624 Bound pvc-be6bd318-140c-4cb8-9c22-daf9ec8dac65 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-48b9ddc4-41bc-4228-a6b5-0aea3a470811 Bound pvc-faa7798e-c045-420f-9d09-44674d9d2326 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-8c880e33-681a-4eae-a57d-3aaf0fb9c950 Bound pvc-cf1a6c2e-0e9e-425c-ae46-b010b086c325 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-aa196378-d10f-45ed-a528-b0d691ec6447 Bound pvc-49fca2f0-3402-429f-884f-7db9012934d6 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bbe074ee-9ba3-4839-b519-af82214a9ad0 Bound pvc-3887e89c-0a5b-4d08-938b-c9cb0a1efaca 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-bfb23073-29e8-4f0d-b2c0-934ff808ad2c Bound pvc-f966f803-ca92-45b6-9395-8d1d24c67f8e 10Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-d39e8f9b-692e-46ac-a52c-2d977f0a95fa Bound pvc-25d7c8c2-7994-4ee8-9ef8-725ae1c8c8a1 8Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-ef1e2362-83bc-4af4-b748-a496aa911009 Bound pvc-7aefd3fe-3279-4e20-8a00-5ca60cc61e40 128Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
a366a76b-2000-4d33-a817-a9c1b9e60b1b-f072ee1b-034a-4ac8-965c-f66a2d8bd61c Bound pvc-276acbee-ba6c-4cc9-8bc5-e18525abd256 20Gi RWO sc2-01-vc17c01-wcp-mgmt 188d
sc201vc17pace-workers-wswdh-2hz8w-containerd Bound pvc-e67e3a6f-99d6-4e21-813d-e9c9994b25d6 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-5pjrc-containerd Bound pvc-fb162388-4347-4f48-825e-c2c2d62ceb90 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-djp2m-containerd Bound pvc-a7542552-de13-4670-ac45-84ed39c3c916 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
sc201vc17pace-workers-wswdh-flwtt-containerd Bound pvc-1b8ee843-709a-4e2a-955d-a9a9a6a83c73 42Gi RWO sc2-01-vc17c01-wcp-mgmt 139d
❯ kg machine -n karvea-vc17ns11
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sc201vc17pace-control-plane-zt99l sc201vc17pace sc201vc17pace-control-plane-zt99l vsphere://4201e660-3124-9aa5-4ec2-6fbc2ff3ecea Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-gxmtt vsphere://42013a9b-dffb-4609-89d6-4ca123c4dc1e Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-j4wvp vsphere://4201160b-21c9-ccc2-6826-e3545e34b490 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-l2dq5 vsphere://420125a8-e45c-04b7-5612-ce3149e86d74 Running 139d v1.20.9+vmware.1
sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv sc201vc17pace sc201vc17pace-workers-jrcb6-85c4844f6c-xqlkv vsphere://4201238f-c9a3-a9b2-9c31-4ed99318bd30 Running 139d v1.20.9+vmware.1
❯ kgtkca | grep karvea
karvea-vc17ns11 sc201vc17pace running 2021-11-19T12:17:24Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4