Showing posts with label TKC. Show all posts
Showing posts with label TKC. Show all posts

Saturday, June 22, 2024

vSphere with Tanzu using NSX-T - Part31 - Troubleshooting inaccessible TKC with expired control plane certs

In the course of managing multiple Tanzu Kubernetes Clusters (TKC), I encountered an unexpected issue: the control plane certificates had expired, preventing us from accessing the cluster using the kubeconfig file. To make matters worse, we were unable to SSH into the TKC control plane Virtual Machines (VMs) due to the vmware-system-user password expiring in accordance with STIG Hardening.

The recommended workaround for updating the vmware-system-user password expiry involves applying a specific daemonset on Guest Clusters. However, this approach requires access to the TKC using its admin kubeconfig file, which was unavailable due to the expired certificates.

Warning: In case of critical production issues that affect the accessibility of your Tanzu Kubernetes Cluster (TKC), it is strongly advised to submit a product support request to our team for assistance. This will ensure that you receive expert guidance and a timely resolution to help minimize the impact on your environment.

To resolve this issue, I followed an alternative workaround: I reset the root password of the TKC control plane VMs through the vCenter VM console, as outlined in this knowledge base article. Once the root password was reset, I was able to log directly into the TKC control plane VM using the VM console.




After gaining access to the TKC control plane VM, I proceeded to renew the control plane certificates using kubeadm, as detailed in this blog post. It's essential to apply this process to all control plane nodes in your cluster to ensure proper functionality.

root [ /etc/kubernetes ]# kubeadm certs check-expiration

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Although this workaround required some additional steps, it ultimately allowed us to regain access to our Tanzu Kubernetes Cluster and maintain its security and functionality.

Hope it was useful. Cheers!

Saturday, May 25, 2024

vSphere with Tanzu using NSX-T - Part30 - Troubleshooting inaccessible TKC with server pool members missing in the LB VS

Encountering issues with connectivity to your TKC apiserver/ control plane can be frustrating. One common problem we've seen is the kubeconfig failing to connect, often due to missing server pool members in the load balancer's virtual server (LB VS).

The Issue

The LB VS, which operates on port 6443, should have the control plane VMs listed as its member servers. When these members are missing, connectivity problems arise, disrupting your access to the TKC apiserver.

Troubleshooting steps

  1. Access the TKC: Use the kubeconfig to access the TKC.
    ❯ KUBECONFIG=tkc.kubeconfig kubectl get node
    Unable to connect to the server: dial tcp 10.191.88.4:6443: i/o timeout
    
    
  2. Check the Load Balancer: In NSX-T, verify the status of the corresponding load balancer (LB). It may display a green status indicating success.
  3. Inspect Virtual Servers: Check the virtual servers in the LB, particularly on port 6443. They might show as down.
  4. Examine Server Pool Members: Look into the server pool members of the virtual server. You may find it empty.
  5. SSH to Control Plane Nodes: Attempt to SSH into the TKC control plane nodes.
  6. Run Diagnostic Commands: Execute diagnostic commands inside the control plane nodes to verify their status. The issue could be that the control plane VMs are in a hung state, and the container runtime is not running.
    vmware-system-user@tkc-infra-r68zc-jmq4j [ ~ ]$ sudo su
    root [ /home/vmware-system-user ]# crictl ps
    FATA[0002] failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded
    root [ /home/vmware-system-user ]#
    root [ /home/vmware-system-user ]# systemctl is-active containerd
    Failed to retrieve unit state: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
    root [ /home/vmware-system-user ]#
    root [ /home/vmware-system-user ]# systemctl status containerd
    WARNING: terminal is not fully functional
    -  (press RETURN)Failed to get properties: Failed to activate service 'org.freedesktop.systemd1'>
    lines 1-1/1 (END)lines 1-1/1 (END)
    
  7. Check VM Console: From vCenter, check the console of the control plane VMs. You might see specific errors indicating issues.
    EXT4-fs (sda3): Delayed block allocation failed for inode 266704 at logical offset 10515 with max blocks 2 with error 5
    EXT4-fs (sda3): This should not happen!! Data will be lost
    EXT4-fs error (device sda3) in ext4_writepages:2905: IO failure
    EXT4-fs error (device sda3) in ext4_reserve_inode_write:5947: Journal has aborted
    EXT4-fs error (device sda3) xxxxxx-xxx-xxxx: unable to read itable block
    EXT4-fs error (device sda3) in ext4_journal_check_start:61: Detected aborted journal
    systemd[1]: Caught <BUS>, dumped core as pid 24777.
    systemd[1]: Freezing execution.
    
  8. Restart Control Plane VMs: Restart the control plane VMs. Note that sometimes your admin credentials or administrator@vsphere.local credentials may not allow you to restart the TKC VMs. In such cases, decode the username and password from the relevant secret and use these credentials to connect to vCenter and restart the hung TKC VMs.
    ❯ kubectx wdc-01-vc17
    Switched to context "wdc-01-vc17".
    
    ❯ kg secret -A | grep wcp
    kube-system                                 wcp-authproxy-client-secret                                               kubernetes.io/tls                                  3      291d
    kube-system                                 wcp-authproxy-root-ca-secret                                              kubernetes.io/tls                                  3      291d
    kube-system                                 wcp-cluster-credentials                                                   Opaque                                             2      291d
    vmware-system-nsop                          wcp-nsop-sa-vc-auth                                                       Opaque                                             2      291d
    vmware-system-nsx                           wcp-cluster-credentials                                                   Opaque                                             2      291d
    vmware-system-vmop                          wcp-vmop-sa-vc-auth                                                       Opaque                                             2      291d
    
    ❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth
    NAME                  TYPE     DATA   AGE
    wcp-vmop-sa-vc-auth   Opaque   2      291d
    ❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth -oyaml
    apiVersion: v1
    data:
      password: aWAmbHUwPCpKe1Uxxxxxxxxxxxx=
      username: d2NwLXZtb3AtdXNlci1kb21haW4tYzEwMDYtMxxxxxxxxxxxxxxxxxxxxxxxxQHZzcGhlcmUubG9jYWw=
    kind: Secret
    metadata:
      creationTimestamp: "2022-10-24T08:32:26Z"
      name: wcp-vmop-sa-vc-auth
      namespace: vmware-system-vmop
      resourceVersion: "336557268"
      uid: dcbdac1b-18bb-438c-ba11-76ed4d6bef63
    type: Opaque
    
    
    ***Decrypt the username and password from the secret and use it to connect to the vCenter.
    ***Following is an example using PowerCLI:
    
    PS /Users/vineetha> get-vm gc-control-plane-f266h
    
    Name                 PowerState Num CPUs MemoryGB
    ----                 ---------- -------- --------
    gc-control-plane-f2… PoweredOn  2        4.000
    
    PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VMGuest
    Restart-VMGuest: 08/04/2023 22:20:20	Restart-VMGuest		Operation "Restart VM guest" failed for VM "gc-control-plane-f266h" for the following reason: A general system error occurred: Invalid fault
    PS /Users/vineetha>
    PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VM
    
    Confirm
    Are you sure you want to perform this action?
    Performing the operation "Restart-VM" on target "VM 'gc-control-plane-f266h'".
    [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): Y
    
    Name                 PowerState Num CPUs MemoryGB
    ----                 ---------- -------- --------
    gc-control-plane-f2… PoweredOn  2        4.000
    
    PS /Users/vineetha>
    
  9. Verify System Pods and Connectivity: Once the control plane VMs are restarted, the system pods inside them will start, and the apiserver will become accessible using the kubeconfig. You should also see the previously missing server pool members reappear in the corresponding LB virtual server, and the virtual server on port 6443 will be up and show a success status.

Following these steps should help you resolve the connectivity issues with your TKC apiserver/control plane effectively.Ensuring that your load balancer's virtual server is correctly configured with the appropriate member servers is crucial for maintaining seamless access. This runbook aims to guide you through the process, helping you get your TKC apiserver back online swiftly.

Note: If required for critical production issues related to TKC accessibility I strongly recommend to raise a product support request.

Hope it was useful. Cheers!

Saturday, August 5, 2023

vSphere with Tanzu using NSX-T - Part28 - Create a custom VM Class

A VM class is a template that defines CPU, memory, and reservations for VMs. If you want to create a custom vmclass you can use dcli or vSphere UI. 

Following is an example using dcli:

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id best-effort-16xlarge --cpu-count 64 --memory-mb 131072

This will create a vmclass with 64 vCPUs and 128GB memory with no reservations.

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id guaranteed-16xlarge --cpu-count 64 --memory-mb 131072 --cpu-reservation 100 --memory-reservation 100

This will create a vmclass with 64 vCPUs and 128GB memory with 100% reservations.

Note: You will need to attach this newly created vmclass to a supervisor namespace to use it.

Here is the documentation reference: https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-with-tanzu-services-workloads/GUID-18C7B2E3-BCF5-488C-9C50-937E29BB0C48.html

Hope it was useful. Cheers!

Friday, June 9, 2023

vSphere with Tanzu using NSX-T - Part26 - Jumpbox kubectl plugin to SSH to TKC node

For troubleshooting TKC (Tanzu Kubernetes Cluster) you may need to ssh into the TKC nodes. For doing ssh, you will need to first create a jumpbox pod under the supervisor namespace and from there you can ssh to the TKC nodes.

Here is the manual procedure: https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html


Following kubectl plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs.

kubectl-jumpbox

#!/bin/bash

Help()
{
   # Display Help
   echo "Description: This plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs."
   echo "Usage: kubectl jumpbox SVNAMESPACE TKCNAME"
   echo "Example: k exec -it jumpbox-tkc1 -n svns1 -- /usr/bin/ssh vmware-system-user@VMIP"
}

# Get the options
while getopts ":h" option; do
   case $option in
      h) # display Help
         Help
         exit;;
     \?) # incorrect option
         echo "Error: Invalid option"
         exit;;
   esac
done

kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: jumpbox-$2
  namespace: $1           #REPLACE
spec:
  containers:
  - image: "photon:3.0"
    name: jumpbox
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
    volumeMounts:
      - mountPath: "/root/ssh"
        name: ssh-key
        readOnly: true
    resources:
      requests:
        memory: 2Gi
  
  volumes:
    - name: ssh-key
      secret:
        secretName: $2-ssh     #REPLACE YOUR-CLUSTER-NAME-ssh 

  
EOF

Usage

  • Place the plugin in the system executable path.
  • I placed it in $HOME/.krew/bin directory in my laptop.
  • Once you copied the plugin to the proper path, you can make it executable by: chmod 755 kubectl-jumpbox
  • After that you should be able to run the plugin as: kubectl jumpbox SUPERVISORNAMESPACE TKCNAME


 

Example

❯ kg tkc -n vineetha-dns1-test
NAME               CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
tkc                1               3        v1.21.6---vmware.1-tkg.1.b3d708a   213d   True    True             [1.22.9+vmware.1-tkg.1.cc71bc8]
tkc-using-cci-ui   1               1        v1.23.8---vmware.3-tkg.1           37d    True    True

❯ kg po -n vineetha-dns1-test
NAME         READY   STATUS    RESTARTS   AGE
nginx-test   1/1     Running   0          29d


❯ kubectl jumpbox vineetha-dns1-test tkc
pod/jumpbox-tkc created

❯ kg po -n vineetha-dns1-test
NAME          READY   STATUS    RESTARTS   AGE
jumpbox-tkc   0/1     Pending   0          8s
nginx-test    1/1     Running   0          29d

❯ kg po -n vineetha-dns1-test
NAME          READY   STATUS    RESTARTS   AGE
jumpbox-tkc   1/1     Running   0          21s
nginx-test    1/1     Running   0          29d

❯ k jumpbox -h
Description: This plugin creats a jumpbox pod under a supervisor namespace. You can exec into this jumpbox pod to ssh into the TKC VMs.
Usage: kubectl jumpbox SVNAMESPACE TKCNAME
Example: k exec -it jumpbox-tkc1 -n svns1 -- /usr/bin/ssh vmware-system-user@VMIP

❯ kg vm -n vineetha-dns1-test -o wide
NAME                                                              POWERSTATE   CLASS               IMAGE                                                       PRIMARY-IP      AGE
tkc-control-plane-8rwpk                                           poweredOn    best-effort-small   ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a   172.29.0.7      133d
tkc-using-cci-ui-control-plane-z8fkt                              poweredOn    best-effort-small   ob-20953521-tkgs-ova-photon-3-v1.23.8---vmware.3-tkg.1      172.29.13.130   37d
tkc-using-cci-ui-tkg-cluster-nodepool-9nf6-n6nt5-b97c86fb45mvgj   poweredOn    best-effort-small   ob-20953521-tkgs-ova-photon-3-v1.23.8---vmware.3-tkg.1      172.29.13.131   37d
tkc-workers-zbrnv-6c98dd84f9-52gn6                                poweredOn    best-effort-small   ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a   172.29.0.6      133d
tkc-workers-zbrnv-6c98dd84f9-d9mm7                                poweredOn    best-effort-small   ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a   172.29.0.8      133d
tkc-workers-zbrnv-6c98dd84f9-kk2dg                                poweredOn    best-effort-small   ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a   172.29.0.3      133d

❯ k exec -it jumpbox-tkc -n vineetha-dns1-test -- /usr/bin/ssh vmware-system-user@172.29.0.7
The authenticity of host '172.29.0.7 (172.29.0.7)' can't be established.
ECDSA key fingerprint is SHA256:B7ptmYm617lFzLErJm7G5IdT7y4SJYKhX/OenSgguv8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.29.0.7' (ECDSA) to the list of known hosts.
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
 13:06:06 up 133 days,  4:46,  0 users,  load average: 0.23, 0.33, 0.27

36 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@tkc-control-plane-8rwpk [ ~ ]$ sudo su
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]#


Hope it was useful. Cheers!

Saturday, May 20, 2023

vSphere with Tanzu using NSX-T - Part25 - Spherelet

The Spherelet is based on the Kubernetes “Kubelet” and enables an ESXi hypervisor to act as a Kubernetes worker node. Sometimes you may notice that the worker nodes of your supervisor cluster are having NotReady,SchedulingDisabled status, and it maybe becuase spherelet is not running on those ESXi nodes.

Following are the steps to verify the status of spherelet service, and restart them if required.

Example:
❯ kubectx wdc-01-vcxx
Switched to context "wdc-01-vcxx".
❯ kubectl get node
NAME                               STATUS                        ROLES                  AGE    VERSION
42019f7e751b2818bb0c659028d49fdc   Ready                         control-plane,master   317d   v1.22.6+vmware.wcp.2
4201b0b21aed78d8e72bfb622bb8b98b   Ready                         control-plane,master   317d   v1.22.6+vmware.wcp.2
4201c53dcef2701a8c36463942d762dc   Ready                         control-plane,master   317d   v1.22.6+vmware.wcp.2
wdc-01-rxxesx04.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx05.xxxxxxxxx.com      NotReady,SchedulingDisabled   agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx06.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx32.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx33.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx34.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx35.xxxxxxxxx.com      Ready,SchedulingDisabled      agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx36.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx37.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx38.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx39.xxxxxxxxx.com      NotReady,SchedulingDisabled   agent                  317d   v1.22.6-sph-db56d46
wdc-01-rxxesx40.xxxxxxxxx.com      Ready                         agent                  317d   v1.22.6-sph-db56d46

Logs

  • ssh into the ESXi worker node.
tail -f /var/log/spherelet.log 


Status

  • ssh into the ESXi worker node and run the following:
etc/init.d/spherelet status
  •  You can check status of spherelet using PowerCLI. Following is an example:
> Connect-VIServer wdc-10-vcxx

> Get-VMHost | Get-VMHostService | where {$_.Key -eq "spherelet"}  | select VMHost,Key,Running | ft

VMHost                        Key       Running
------                        ---       -------
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet    True

Restart

  • ssh into the ESXi worker node and run the following:
/etc/init.d/spherelet restart
  • You can also restart spherelet service using PowerCLI. Following is an example to restart spherelet service on ALL the ESXi worker nodes of a cluster:
> Get-Cluster

Name                           HAEnabled  HAFailover DrsEnabled DrsAutomationLevel
                                          Level
----                           ---------  ---------- ---------- ------------------
wdc-10-vcxxc01                 True       1          True       FullyAutomated

> Get-Cluster -Name wdc-10-vcxxc01 | Get-VMHost | foreach { Restart-VMHostService -HostService ($_ | Get-VMHostService | where {$_.Key -eq "spherelet"}) }

Certificates

You may notice the ESXi worker nodes in NotReady state when the following spherelet certs expire.
  • /etc/vmware/spherelet/spherelet.crt
  • /etc/vmware/spherelet/client.crt
 
An example is given below:
❯ kg no
NAME STATUS ROLES AGE VERSION
420802008ec0d8ccaa6ac84140768375 Ready control-plane,master 70d v1.22.6+vmware.wcp.2
42087a63440b500de6cec759bb5900bf Ready control-plane,master 77d v1.22.6+vmware.wcp.2
4208e08c826dfe283c726bc573109dbb Ready control-plane,master 77d v1.22.6+vmware.wcp.2
wdc-08-rxxesx25.xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx26.
xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx23.
xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx24.
xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx25.
xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46
wdc-08-rxxesx26.
xxxxxxxxx.com NotReady agent 370d v1.22.6-sph-db56d46

You can ssh into the ESXi worker nodes and verify the validity of the above mentioned certs. They have a life time of one year.
 
Example:
[root@wdc-08-rxxesx25:~] openssl x509 -enddate -noout -in /etc/vmware/spherelet/spherelet.crt
notAfter=Sep 1 08:32:24 2023 GMT
[root@wdc-08-rxxesx25:~] openssl x509 -enddate -noout -in /etc/vmware/spherelet/client.crt
notAfter=Sep 1 08:32:24 2023 GMT
Depending on your support contract, if its a production environment you may need to open a case with VMware GSS for resolving this issue. 
 
Ref KBs: 


Verify

❯ kubectl get node
NAME                               STATUS   ROLES                  AGE     VERSION
42017dcb669bea2962da27fc2f6c16d2   Ready    control-plane,master   5d20h   v1.23.12+vmware.wcp.1
4201b763c766875b77bcb9f04f8840b3   Ready    control-plane,master   5d21h   v1.23.12+vmware.wcp.1
4201dab068e9b2d3af3b8fde450b3d96   Ready    control-plane,master   5d20h   v1.23.12+vmware.wcp.1
wdc-01-rxxesx04.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx05.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx06.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx32.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx33.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx34.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx35.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx36.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx37.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx38.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx39.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
wdc-01-rxxesx40.xxxxxxxxx.com      Ready    agent                  5d19h   v1.23.5-sph-81ef5d1
   

Hope it was useful. Cheers!

Saturday, April 8, 2023

vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC

The Kubernetes component certificates inside a TKC (Tanzu Kubernetes Cluster) has lifetime of 1 year. If you manage to upgrade your TKC atleast once a year, these certs will get rotated automatically. 

 

IMPORTANT NOTES: 

  • As per this VMware KB, if TKGS Guest Cluster certificates are expired, you will need to engage VMware support to manually rotate them.  
  • Following troubleshooting steps and workaround are based on studies conducted on my dev/ test/ lab setup, and I will NOT recommend anyone to follow these on your production environment.

 

Symptom:

KUBECONFIG=tkc.kubeconfig kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid

 

Troubleshooting:

  • Verify the certificate expiry of the tkc kubeconfig file itself.
❯ grep client-certificate-data tkc.kubeconfig | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Mar  7 18:26:10 2024 GMT
  • Create a jumpbox pod and ssh to TKC control plane nodes.
  • Verify system pods and check logs from apiserver and etcd pods. Sample etcd pod logs are given below:
2023-04-11 07:09:00.268792 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268835 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268841 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.268869 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.310030 I | embed: rejected connection from "172.31.20.27:35362" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.312806 I | embed: rejected connection from "172.31.20.27:35366" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.321449 I | embed: rejected connection from "172.31.20.19:35034" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.322192 I | embed: rejected connection from "172.31.20.19:35036" (error "remote error: tls: bad certificate", ServerName "")
  • Verify whether admin.conf inside the control plane node has expired.
root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Apr  6 06:05:46 2023 GMT
  • Verify Kubernetes component certs in all the control plane nodes.
root [ /etc/kubernetes ]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Apr 06, 2023 06:05 UTC   <invalid>                               no
apiserver                  Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
apiserver-etcd-client      Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
controller-manager.conf    Apr 06, 2023 06:05 UTC   <invalid>                               no
etcd-healthcheck-client    Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-server                Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Apr 06, 2023 06:05 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Apr 06, 2023 06:05 UTC   <invalid>                               no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Mar 05, 2032 18:15 UTC   8y              no
etcd-ca                 Mar 05, 2032 18:15 UTC   8y              no
front-proxy-ca          Mar 05, 2032 18:15 UTC   8y              no

 

Workaround:

  • Renew Kubernetes component certs on control plane nodes if expired using kubeadm certs renew all.
root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

 

Verify:

  • Verify using the following steps on all the TKC control plane nodes.
root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

root [ /etc/kubernetes ]# kubeadm certs check-expiration

  • Try connect to the TKC using tkc.kubeconfig.
KUBECONFIG=tkc.kubeconfig kubectl get node

Hope it was useful. Cheers!

References: 

Friday, January 6, 2023

vSphere with Tanzu using NSX-T - Part22 - Working with NGINX Ingress Controller

In this article we will go though the steps to deploy a nginx ingress controller on a Tanzu Kubernetes cluster (TKC) and create a simple ingress resource to test its basic functionality.

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
tkc-control-plane-5m9hd Ready control-plane,master 36d v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-76f2t Ready <none> 36d v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-mtqh7 Ready <none> 36d v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-xh2gz Ready <none> 36d v1.23.8+vmware.3

❯ gcc k apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.7.0/deploy/static/provider/cloud/deploy.yaml --namespace=ingress-nginx
namespace/ingress-nginx created
serviceaccount/ingress-nginx created
serviceaccount/ingress-nginx-admission created
role.rbac.authorization.k8s.io/ingress-nginx created
role.rbac.authorization.k8s.io/ingress-nginx-admission created
clusterrole.rbac.authorization.k8s.io/ingress-nginx created
clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission created
rolebinding.rbac.authorization.k8s.io/ingress-nginx created
rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created
configmap/ingress-nginx-controller created
service/ingress-nginx-controller created
service/ingress-nginx-controller-admission created
deployment.apps/ingress-nginx-controller created
job.batch/ingress-nginx-admission-create created
job.batch/ingress-nginx-admission-patch created
ingressclass.networking.k8s.io/nginx created
validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created
 
❯ gcc kg ns
NAME STATUS AGE
default Active 57d
external-dns Active 57d
ingress-nginx Active 17s
kube-node-lease Active 57d
kube-public Active 57d
kube-system Active 57d
vmware-system-auth Active 57d
vmware-system-cloud-provider Active 57d
vmware-system-csi Active 57d

❯ gcc kg deployment,po,svc,ep -n ingress-nginx
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ingress-nginx-controller 1/1 1 1 21h

NAME READY STATUS RESTARTS AGE
pod/ingress-nginx-admission-create-h4sbz 0/1 Completed 0 21h
pod/ingress-nginx-admission-patch-bw2fr 0/1 Completed 0 21h
pod/ingress-nginx-controller-5795977b8-nfrb8 1/1 Running 0 21h

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ingress-nginx-controller LoadBalancer 10.96.114.127 10.186.124.41 80:30061/TCP,443:31417/TCP 21h
service/ingress-nginx-controller-admission ClusterIP 10.98.183.189 <none> 443/TCP 21h

NAME ENDPOINTS AGE
endpoints/ingress-nginx-controller 192.168.7.8:443,192.168.7.8:80 21h
endpoints/ingress-nginx-controller-admission 192.168.7.8:8443 21h

Now the nginx ingress controller is deployed. You can also see the service/ingress-nginx-controller has already got an external IP from NSX-T.

Note: gcc is an alias which points to my TKC kubeconfig file.

❯ alias gcc
gcc='KUBECONFIG=gckubeconfig'

Lets create a sample deployment and expose it as a service under namespace ingress-nginx.

❯ gcc kubectl create deployment web --image=gcr.io/google-samples/hello-app:1.0 -n ingress-nginx
deployment.apps/web created
❯ gcc kubectl expose deployment web --type=NodePort --port=8080 -n ingress-nginx
service/web exposed

❯ gcc k get deployments.apps web -n ingress-nginx
NAME READY UP-TO-DATE AVAILABLE AGE
web 1/1 1 1 28s
❯ gcc k get svc web -n ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web NodePort 10.105.243.33 <none> 8080:30750/TCP 28s
❯ gcc k get ep web -n ingress-nginx
NAME ENDPOINTS AGE
web 192.168.1.9:8080 39s

Create a pod on the TKC and try to access the svc web from inside the pod. I've already deployed a nginx pod.

❯ gcc k get po nginx
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 96m

❯ gcc k exec -it nginx -- curl 10.105.243.33:8080
Hello, world!
Version: 1.0.0
Hostname: web-746c8679d4-ptmgh

Lets create a second deployment under namespace ingress-nginx.

❯ gcc kubectl create deployment web2 --image=gcr.io/google-samples/hello-app:2.0 -n ingress-nginx
deployment.apps/web2 created

❯ gcc kubectl expose deployment web2 --port=8080 --type=NodePort -n ingress-nginx
service/web2 exposed


❯ gcc k get deployment web2 -n ingress-nginx
NAME READY UP-TO-DATE AVAILABLE AGE
web2 1/1 1 1 56s
❯ gcc k get svc web2 -n ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web2 NodePort 10.99.79.19 <none> 8080:31695/TCP 65s
❯ gcc k get ep web2 -n ingress-nginx
NAME ENDPOINTS AGE
web2 192.168.2.13:8080 73s

Verify svc web2.

❯ gcc k exec -it nginx -- curl 10.99.79.19:8080
Hello, world!
Version: 2.0.0
Hostname: web2-5858b4c7c5-tmn8x

Service web and web2 are accessible within the TKC. We've already verified it from the nginx pod that runs within the same TKC.

Now, we will create an ingress resource under namespace ingress-nginx.

❯ cat ing-01.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hello-world-ing
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
rules:
- host: hello-world.info
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web
port:
number: 8080
- path: /v2
pathType: Prefix
backend:
service:
name: web2
port:
number: 8080
❯ gcc k create -f ing-01.yaml -n ingress-nginx
ingress.networking.k8s.io/hello-world-ing created

❯ gcc k get ing -n ingress-nginx
NAME CLASS HOSTS ADDRESS PORTS AGE
hello-world-ing <none> hello-world.info 80 55s
❯ gcc k get ing -n ingress-nginx
NAME CLASS HOSTS ADDRESS PORTS AGE
hello-world-ing <none> hello-world.info 10.186.124.41 80 56s

I've created a entry in /etc/hosts file in my laptop so that hello-world.info resolves to 10.186.124.41 which is the external IP of service/ingress-nginx-controller.

❯ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
# Added by Docker Desktop
# To allow the same kube context to work on the host and the container:
127.0.0.1 kubernetes.docker.internal
10.186.124.41 hello-world.info
# End of section

Now from my laptop when I curl to hello-world.info, the request will be served by web svc, and when I curl to hello-world.info/v2, it will be served by web2 svc.


❯ curl hello-world.info
Hello, world!
Version: 1.0.0
Hostname: web-746c8679d4-ptmgh

❯ curl hello-world.info/v2
Hello, world!
Version: 2.0.0
Hostname: web2-5858b4c7c5-tmn8x

Hope it was useful. Cheers! 

References:

https://kubernetes.io/docs/tasks/access-application-cluster/ingress-minikube/
https://kubernetes.github.io/ingress-nginx/user-guide/basic-usage/

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state. 

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
5

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-tnhv9 NotReady                    control-plane,master 63d v1.20.9+vmware.1
gc-control-plane-tqvnk NotReady                   control-plane,master 50d v1.20.9+vmware.1
gc-control-plane-wsclb NotReady                   <none> 8d v1.20.9+vmware.1
gc-control-plane-wt6sx NotReady                   <none> 30d v1.20.9+vmware.1
gc-control-plane-zthnq NotReady                   control-plane,master 49d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc updating 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: Ready
Last Transition Time: 2023-01-01T19:19:45Z
Status: True
Type: AddonsReady
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: ControlPlaneReady
Last Transition Time: 2022-07-24T15:53:06Z
Status: True
Type: NodePoolsReady
Last Transition Time: 2022-09-01T09:02:26Z
Message: 3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
Status: True
Type: NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.
❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready,SchedulingDisabled control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-control-plane-dnr66 gc Provisioning 13s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 124d v1.20.9+vmware.1
gc-control-plane-dnr66 gc gc-control-plane-dnr66 vsphere://42011228-b156-3338-752a-e7233c9258dd Running 2m2s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 NotReady control-plane,master 35s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 Ready control-plane,master 53s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc running 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3
Hope it was useful. Cheers!