vineethac.blogspot.com: Control Plane Nodes

Showing posts with label Control Plane Nodes. Show all posts

Wednesday, June 26, 2024

vSphere with Tanzu using NSX-T - Part33 - Troubleshooting intermittent connection timeouts to apiserver and workloads

In the realm of managing Tanzu Kubernetes clusters (TKCs), we have encountered several challenges that hindered the smooth functioning of our applications. In this blog post, we will discuss three such cases and the workarounds we employed to resolve them.

Case 1: TKC Control Plane Node Connectivity Issues

Symptoms:

TKC apiserver connection timeouts when attempting to connect using the kubeconfig.
Traffic was not flowing to two of the control plane nodes.
NSX-T web UI LB VS stats indicated this issue.

Case 2: TKC Worker Node Connectivity Issues

Symptoms:

Workload (example: PostgreSQL cluster) connection timeouts.
Traffic was not flowing to two of the worker nodes in the TKC.
NSX-T web UI LB VS stats indicated this issue.

Case 3: Load Balancer Connectivity Issues

Symptoms:

Connection timeouts when attempting to connect to a PostgreSQL workload through the load balancer VS IP.
This issue was observed only when creating new services of type LoadBalancer in the TKC.
We noticed datapath mempool usage for the edge nodes was above the threshold value.

Resolution/ work around

Find the T1 router that is attached to the TKC which has connectivity issues.
In an Active - Standby HA configuration, you will see that there will be one Edge node that will be Active and another one in Standby status.
First place the Standby Edge node in NSX MM, reboot it, and then exit it from NSX MM.
Now, place the Active Edge node in NSX MM, there will be a slight network disruption during this failover, once it is in NSX MM, reboot it, and then exit NSX MM.
This should resolved the issue.

In conclusion, these cases illustrate the importance of verifying NSX-T components when managing Tanzu Kubernetes clusters. By identifying the root cause of the issues and employing effective workarounds, we were able to restore functionality and maintain the health of our applications. Stay tuned for more insights and best practices in managing Kubernetes clusters.

Hope it was useful. Cheers!

Saturday, June 22, 2024

vSphere with Tanzu using NSX-T - Part31 - Troubleshooting inaccessible TKC with expired control plane certs

In the course of managing multiple Tanzu Kubernetes Clusters (TKC), I encountered an unexpected issue: the control plane certificates had expired, preventing us from accessing the cluster using the kubeconfig file. To make matters worse, we were unable to SSH into the TKC control plane Virtual Machines (VMs) due to the vmware-system-user password expiring in accordance with STIG Hardening.

The recommended workaround for updating the vmware-system-user password expiry involves applying a specific daemonset on Guest Clusters. However, this approach requires access to the TKC using its admin kubeconfig file, which was unavailable due to the expired certificates.

Warning: In case of critical production issues that affect the accessibility of your Tanzu Kubernetes Cluster (TKC), it is strongly advised to submit a product support request to our team for assistance. This will ensure that you receive expert guidance and a timely resolution to help minimize the impact on your environment.

To resolve this issue, I followed an alternative workaround: I reset the root password of the TKC control plane VMs through the vCenter VM console, as outlined in this knowledge base article. Once the root password was reset, I was able to log directly into the TKC control plane VM using the VM console.

After gaining access to the TKC control plane VM, I proceeded to renew the control plane certificates using kubeadm, as detailed in this blog post. It's essential to apply this process to all control plane nodes in your cluster to ensure proper functionality.

root [ /etc/kubernetes ]# kubeadm certs check-expiration

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Although this workaround required some additional steps, it ultimately allowed us to regain access to our Tanzu Kubernetes Cluster and maintain its security and functionality.

Hope it was useful. Cheers!

Saturday, May 25, 2024

vSphere with Tanzu using NSX-T - Part30 - Troubleshooting inaccessible TKC with server pool members missing in the LB VS

Encountering issues with connectivity to your TKC apiserver/ control plane can be frustrating. One common problem we've seen is the kubeconfig failing to connect, often due to missing server pool members in the load balancer's virtual server (LB VS).

The Issue

The LB VS, which operates on port 6443, should have the control plane VMs listed as its member servers. When these members are missing, connectivity problems arise, disrupting your access to the TKC apiserver.

Troubleshooting steps

Access the TKC: Use the kubeconfig to access the TKC.

❯ KUBECONFIG=tkc.kubeconfig kubectl get node
Unable to connect to the server: dial tcp 10.191.88.4:6443: i/o timeout
❯

Check the Load Balancer: In NSX-T, verify the status of the corresponding load balancer (LB). It may display a green status indicating success.
Inspect Virtual Servers: Check the virtual servers in the LB, particularly on port 6443. They might show as down.
Examine Server Pool Members: Look into the server pool members of the virtual server. You may find it empty.
SSH to Control Plane Nodes: Attempt to SSH into the TKC control plane nodes.

Run Diagnostic Commands: Execute diagnostic commands inside the control plane nodes to verify their status. The issue could be that the control plane VMs are in a hung state, and the container runtime is not running.

vmware-system-user@tkc-infra-r68zc-jmq4j [ ~ ]$ sudo su
root [ /home/vmware-system-user ]# crictl ps
FATA[0002] failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl is-active containerd
Failed to retrieve unit state: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl status containerd
WARNING: terminal is not fully functional
-  (press RETURN)Failed to get properties: Failed to activate service 'org.freedesktop.systemd1'>
lines 1-1/1 (END)lines 1-1/1 (END)

Check VM Console: From vCenter, check the console of the control plane VMs. You might see specific errors indicating issues.

EXT4-fs (sda3): Delayed block allocation failed for inode 266704 at logical offset 10515 with max blocks 2 with error 5
EXT4-fs (sda3): This should not happen!! Data will be lost
EXT4-fs error (device sda3) in ext4_writepages:2905: IO failure
EXT4-fs error (device sda3) in ext4_reserve_inode_write:5947: Journal has aborted
EXT4-fs error (device sda3) xxxxxx-xxx-xxxx: unable to read itable block
EXT4-fs error (device sda3) in ext4_journal_check_start:61: Detected aborted journal
systemd[1]: Caught <BUS>, dumped core as pid 24777.
systemd[1]: Freezing execution.

Restart Control Plane VMs: Restart the control plane VMs. Note that sometimes your admin credentials or administrator@vsphere.local credentials may not allow you to restart the TKC VMs. In such cases, decode the username and password from the relevant secret and use these credentials to connect to vCenter and restart the hung TKC VMs.

❯ kubectx wdc-01-vc17
Switched to context "wdc-01-vc17".
❯
❯ kg secret -A | grep wcp
kube-system                                 wcp-authproxy-client-secret                                               kubernetes.io/tls                                  3      291d
kube-system                                 wcp-authproxy-root-ca-secret                                              kubernetes.io/tls                                  3      291d
kube-system                                 wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-nsop                          wcp-nsop-sa-vc-auth                                                       Opaque                                             2      291d
vmware-system-nsx                           wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-vmop                          wcp-vmop-sa-vc-auth                                                       Opaque                                             2      291d
❯
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth
NAME                  TYPE     DATA   AGE
wcp-vmop-sa-vc-auth   Opaque   2      291d
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth -oyaml
apiVersion: v1
data:
  password: aWAmbHUwPCpKe1Uxxxxxxxxxxxx=
  username: d2NwLXZtb3AtdXNlci1kb21haW4tYzEwMDYtMxxxxxxxxxxxxxxxxxxxxxxxxQHZzcGhlcmUubG9jYWw=
kind: Secret
metadata:
  creationTimestamp: "2022-10-24T08:32:26Z"
  name: wcp-vmop-sa-vc-auth
  namespace: vmware-system-vmop
  resourceVersion: "336557268"
  uid: dcbdac1b-18bb-438c-ba11-76ed4d6bef63
type: Opaque
❯

***Decrypt the username and password from the secret and use it to connect to the vCenter.
***Following is an example using PowerCLI:

PS /Users/vineetha> get-vm gc-control-plane-f266h

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VMGuest
Restart-VMGuest: 08/04/2023 22:20:20	Restart-VMGuest		Operation "Restart VM guest" failed for VM "gc-control-plane-f266h" for the following reason: A general system error occurred: Invalid fault
PS /Users/vineetha>
PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VM

Confirm
Are you sure you want to perform this action?
Performing the operation "Restart-VM" on target "VM 'gc-control-plane-f266h'".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): Y

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha>

Verify System Pods and Connectivity: Once the control plane VMs are restarted, the system pods inside them will start, and the apiserver will become accessible using the kubeconfig. You should also see the previously missing server pool members reappear in the corresponding LB virtual server, and the virtual server on port 6443 will be up and show a success status.

Following these steps should help you resolve the connectivity issues with your TKC apiserver/control plane effectively.Ensuring that your load balancer's virtual server is correctly configured with the appropriate member servers is crucial for maintaining seamless access. This runbook aims to guide you through the process, helping you get your TKC apiserver back online swiftly.

Note: If required for critical production issues related to TKC accessibility I strongly recommend to raise a product support request.

Hope it was useful. Cheers!

Saturday, April 8, 2023

vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC

The Kubernetes component certificates inside a TKC (Tanzu Kubernetes Cluster) has lifetime of 1 year. If you manage to upgrade your TKC atleast once a year, these certs will get rotated automatically.

IMPORTANT NOTES:

As per this VMware KB, if TKGS Guest Cluster certificates are expired, you will need to engage VMware support to manually rotate them.
Following troubleshooting steps and workaround are based on studies conducted on my dev/ test/ lab setup, and I will NOT recommend anyone to follow these on your production environment.

Symptom:

❯ KUBECONFIG=tkc.kubeconfig kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid

Troubleshooting:

Verify the certificate expiry of the tkc kubeconfig file itself.

❯ grep client-certificate-data tkc.kubeconfig | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Mar  7 18:26:10 2024 GMT

Create a jumpbox pod and ssh to TKC control plane nodes.
Verify system pods and check logs from apiserver and etcd pods. Sample etcd pod logs are given below:

2023-04-11 07:09:00.268792 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268835 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268841 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.268869 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.310030 I | embed: rejected connection from "172.31.20.27:35362" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.312806 I | embed: rejected connection from "172.31.20.27:35366" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.321449 I | embed: rejected connection from "172.31.20.19:35034" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.322192 I | embed: rejected connection from "172.31.20.19:35036" (error "remote error: tls: bad certificate", ServerName "")

Verify whether admin.conf inside the control plane node has expired.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Apr  6 06:05:46 2023 GMT

Verify Kubernetes component certs in all the control plane nodes.

root [ /etc/kubernetes ]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Apr 06, 2023 06:05 UTC   <invalid>                               no
apiserver                  Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
apiserver-etcd-client      Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
controller-manager.conf    Apr 06, 2023 06:05 UTC   <invalid>                               no
etcd-healthcheck-client    Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-server                Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Apr 06, 2023 06:05 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Apr 06, 2023 06:05 UTC   <invalid>                               no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Mar 05, 2032 18:15 UTC   8y              no
etcd-ca                 Mar 05, 2032 18:15 UTC   8y              no
front-proxy-ca          Mar 05, 2032 18:15 UTC   8y              no

Workaround:

Renew Kubernetes component certs on control plane nodes if expired using kubeadm certs renew all.

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Verify:

Verify using the following steps on all the TKC control plane nodes.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

root [ /etc/kubernetes ]# kubeadm certs check-expiration

Try connect to the TKC using tkc.kubeconfig.

KUBECONFIG=tkc.kubeconfig kubectl get node

Hope it was useful. Cheers!

References:

https://kb.vmware.com/s/article/86251

https://kb.vmware.com/s/article/89324

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state.

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
       5

❯ gcc kg no
NAME                                STATUS                        ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready                         control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-tnhv9              NotReady                      control-plane,master   63d    v1.20.9+vmware.1
gc-control-plane-tqvnk              NotReady                      control-plane,master   50d    v1.20.9+vmware.1
gc-control-plane-wsclb              NotReady                      <none>                 8d     v1.20.9+vmware.1
gc-control-plane-wt6sx              NotReady                      <none>                 30d    v1.20.9+vmware.1
gc-control-plane-zthnq              NotReady                      control-plane,master   49d    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                         <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                         <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                         <none>                 458d   v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready    control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯
❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01    gc       updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-01-01T19:19:45Z
    Status:                True
    Type:                  AddonsReady
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-07-24T15:53:06Z
    Status:                True
    Type:                  NodePoolsReady
    Last Transition Time:  2022-09-01T09:02:26Z
    Message:               3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
    Status:                True
    Type:                  NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
 gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME                                STATUS                     ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready,SchedulingDisabled   control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                      <none>                 458d   v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted
❯
❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯

After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE          AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running        124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running        123d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc                                                                                             Provisioning   13s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running        458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running        456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running        458d   v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   124d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc        gc-control-plane-dnr66              vsphere://42011228-b156-3338-752a-e7233c9258dd   Running   2m2s   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS     ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready      control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              NotReady   control-plane,master   35s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready      <none>                 458d   v1.20.9+vmware.1


❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              Ready    control-plane,master   53s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01     gc     running      2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Hope it was useful. Cheers!

Friday, April 8, 2022

Working with Kubernetes using Python - Part 03 - Get nodes

Following code snipet uses kubeconfig python module to switch context and Python client for the kubernetes API to get cluster node details. It takes the default kubeconfig file, and switch to the required context, and get node info of the respective cluster.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>
kubectl get nodes -o json

Code:

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/
https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Pages

Wednesday, June 26, 2024

vSphere with Tanzu using NSX-T - Part33 - Troubleshooting intermittent connection timeouts to apiserver and workloads

Case 1: TKC Control Plane Node Connectivity Issues

Case 2: TKC Worker Node Connectivity Issues

Case 3: Load Balancer Connectivity Issues

Resolution/ work around

Saturday, June 22, 2024

vSphere with Tanzu using NSX-T - Part31 - Troubleshooting inaccessible TKC with expired control plane certs

Saturday, May 25, 2024

vSphere with Tanzu using NSX-T - Part30 - Troubleshooting inaccessible TKC with server pool members missing in the LB VS

The Issue

Troubleshooting steps

Saturday, April 8, 2023

vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC

IMPORTANT NOTES:

Symptom:

Troubleshooting:

Workaround:

Verify:

References:

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

Friday, April 8, 2022

Working with Kubernetes using Python - Part 03 - Get nodes