vineethac.blogspot.com

Saturday, April 8, 2023

vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC

The Kubernetes component certificates inside a TKC (Tanzu Kubernetes Cluster) has lifetime of 1 year. If you manage to upgrade your TKC atleast once a year, these certs will get rotated automatically.

IMPORTANT NOTES:

As per this VMware KB, if TKGS Guest Cluster certificates are expired, you will need to engage VMware support to manually rotate them.
Following troubleshooting steps and workaround are based on studies conducted on my dev/ test/ lab setup, and I will NOT recommend anyone to follow these on your production environment.

Symptom:

❯ KUBECONFIG=tkc.kubeconfig kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid

Troubleshooting:

Verify the certificate expiry of the tkc kubeconfig file itself.

❯ grep client-certificate-data tkc.kubeconfig | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Mar  7 18:26:10 2024 GMT

Create a jumpbox pod and ssh to TKC control plane nodes.
Verify system pods and check logs from apiserver and etcd pods. Sample etcd pod logs are given below:

2023-04-11 07:09:00.268792 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268835 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268841 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.268869 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.310030 I | embed: rejected connection from "172.31.20.27:35362" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.312806 I | embed: rejected connection from "172.31.20.27:35366" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.321449 I | embed: rejected connection from "172.31.20.19:35034" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.322192 I | embed: rejected connection from "172.31.20.19:35036" (error "remote error: tls: bad certificate", ServerName "")

Verify whether admin.conf inside the control plane node has expired.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Apr  6 06:05:46 2023 GMT

Verify Kubernetes component certs in all the control plane nodes.

root [ /etc/kubernetes ]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Apr 06, 2023 06:05 UTC   <invalid>                               no
apiserver                  Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
apiserver-etcd-client      Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
controller-manager.conf    Apr 06, 2023 06:05 UTC   <invalid>                               no
etcd-healthcheck-client    Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-server                Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Apr 06, 2023 06:05 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Apr 06, 2023 06:05 UTC   <invalid>                               no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Mar 05, 2032 18:15 UTC   8y              no
etcd-ca                 Mar 05, 2032 18:15 UTC   8y              no
front-proxy-ca          Mar 05, 2032 18:15 UTC   8y              no

Workaround:

Renew Kubernetes component certs on control plane nodes if expired using kubeadm certs renew all.

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Verify:

Verify using the following steps on all the TKC control plane nodes.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

root [ /etc/kubernetes ]# kubeadm certs check-expiration

Try connect to the TKC using tkc.kubeconfig.

KUBECONFIG=tkc.kubeconfig kubectl get node

Hope it was useful. Cheers!

References:

https://kb.vmware.com/s/article/86251

https://kb.vmware.com/s/article/89324

Saturday, March 18, 2023

Kubernetes 101 - Part7 - Restart all deployments and daemonsets in a namespace

Restart all deployments in a namespace

❯ kubectl rollout restart deployments -n <namespace>

Restart all daemonsets in a namespace

❯ kubectl rollout restart daemonsets -n <namespace>

Hope it was useful. Cheers!

Saturday, March 11, 2023

Kubernetes 101 - Part6 - Get static pods

Static pods are directly managed by the kubelet on a specific node. More about static pods can be found here: https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/

In this post we will take a look at how to find all static pods in a Kubernetes cluster. For a static pod the owner reference kind will be Node.

custom-columns:

❯ kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,CONTROLLER:'.metadata.ownerReferences[].kind',NAMESPACE:.metadata.namespace | grep Node
❯
❯ kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,CONTROLLER:'.metadata.ownerReferences[].kind',NAMESPACE:.metadata.namespace | grep Node | wc -l

jsonpath:

❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}'
❯
❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}' | jq
❯ 
❯ kubectl get pods -A -o=jsonpath='{.items[*].metadata.ownerReferences[?(@.kind=="Node")]}' | jq | grep Node | wc -l

Hope it was useful. Cheers!

Saturday, February 4, 2023

vSphere with Tanzu using NSX-T - Part23 - Supervisor cluster certificates expiry

Note that the supervisor control plane component certificates will expire after one year.

Here is the VMware KB: https://kb.vmware.com/s/article/89324

NOTE: If certificates expire on the Supervisor or Guest Clusters, access and management of the clusters will fail. And, you will need to raise a case with VMware support team for assistance.

Keep a note of this cert expiry date, and if you can update the supervisor cluster atleast once in a year, these certs will get updated.

Here is a quick way to check the expiry of the supervisor control plane certs.

❯ k config current-context
sc2-06-d5165f-vc01
❯
❯ k cluster-info
Kubernetes control plane is running at https://10.43.69.117:6443
KubeDNS is running at https://10.43.69.117:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
❯
❯ echo | openssl s_client -servername 10.43.69.117 -connect  10.43.69.117:6443 | openssl x509 -noout -dates
depth=0 CN = kube-apiserver
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = kube-apiserver
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
notBefore=Jun  2 09:36:17 2023 GMT
notAfter=Jun  1 09:36:18 2024 GMT
❯

Thanks to my friend Ravikrithik Udainath for the above openssl tip!

I am using the admin kubeconfig of the supervisor cluster. Here is the link to my previous article on exporting WCP admin kubeconfig file. In this case, 10.43.69.117 is the floating IP for the supervisor control plane and it is assigned to one of the supervisor control plane VMs.

This vSphere with Tanzu cluster was deployed on June 02, 2023, and as you can see above, the certificate expiry will be after one year, which in this case is June 01, 2024.

You can set up some sort of monitoring/ alerting for all your supervisor clusters to get notification on these expiry dates.

Hope it was useful. Cheers!

Saturday, January 7, 2023

vSphere with Tanzu using NSX-T - Part22 - Working with NGINX Ingress Controller

In this article we will go though the steps to deploy a nginx ingress controller on a Tanzu Kubernetes cluster (TKC) and create a simple ingress resource to test its basic functionality.

❯ gcc kg no
NAME                                 STATUS   ROLES                  AGE   VERSION
tkc-control-plane-5m9hd              Ready    control-plane,master   36d   v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-76f2t   Ready    <none>                 36d   v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-mtqh7   Ready    <none>                 36d   v1.23.8+vmware.3
tkc-workers-6d8wc-5669d8bc79-xh2gz   Ready    <none>                 36d   v1.23.8+vmware.3

❯ gcc k apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.7.0/deploy/static/provider/cloud/deploy.yaml --namespace=ingress-nginx
namespace/ingress-nginx created
serviceaccount/ingress-nginx created
serviceaccount/ingress-nginx-admission created
role.rbac.authorization.k8s.io/ingress-nginx created
role.rbac.authorization.k8s.io/ingress-nginx-admission created
clusterrole.rbac.authorization.k8s.io/ingress-nginx created
clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission created
rolebinding.rbac.authorization.k8s.io/ingress-nginx created
rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created
configmap/ingress-nginx-controller created
service/ingress-nginx-controller created
service/ingress-nginx-controller-admission created
deployment.apps/ingress-nginx-controller created
job.batch/ingress-nginx-admission-create created
job.batch/ingress-nginx-admission-patch created
ingressclass.networking.k8s.io/nginx created
validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created

❯ gcc kg ns
NAME                           STATUS   AGE
default                        Active   57d
external-dns                   Active   57d
ingress-nginx                  Active   17s
kube-node-lease                Active   57d
kube-public                    Active   57d
kube-system                    Active   57d
vmware-system-auth             Active   57d
vmware-system-cloud-provider   Active   57d
vmware-system-csi              Active   57d
❯ 
❯ gcc kg deployment,po,svc,ep -n ingress-nginx
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ingress-nginx-controller   1/1     1            1           21h

NAME                                           READY   STATUS      RESTARTS   AGE
pod/ingress-nginx-admission-create-h4sbz       0/1     Completed   0          21h
pod/ingress-nginx-admission-patch-bw2fr        0/1     Completed   0          21h
pod/ingress-nginx-controller-5795977b8-nfrb8   1/1     Running     0          21h

NAME                                         TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
service/ingress-nginx-controller             LoadBalancer   10.96.114.127   10.186.124.41   80:30061/TCP,443:31417/TCP   21h
service/ingress-nginx-controller-admission   ClusterIP      10.98.183.189   <none>          443/TCP                      21h

NAME                                           ENDPOINTS                        AGE
endpoints/ingress-nginx-controller             192.168.7.8:443,192.168.7.8:80   21h
endpoints/ingress-nginx-controller-admission   192.168.7.8:8443                 21h

Now the nginx ingress controller is deployed. You can also see the service/ingress-nginx-controller has already got an external IP from NSX-T.

Note: gcc is an alias which points to my TKC kubeconfig file.

❯ alias gcc
gcc='KUBECONFIG=gckubeconfig'
❯

Lets create a sample deployment and expose it as a service under namespace ingress-nginx.

❯ gcc kubectl create deployment web --image=gcr.io/google-samples/hello-app:1.0 -n ingress-nginx
deployment.apps/web created
❯ gcc kubectl expose deployment web --type=NodePort --port=8080 -n ingress-nginx
service/web exposed
❯
❯ gcc k get deployments.apps web -n ingress-nginx
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
web    1/1     1            1           28s
❯ gcc k get svc web -n ingress-nginx
NAME   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
web    NodePort   10.105.243.33   <none>        8080:30750/TCP   28s
❯ gcc k get ep web -n ingress-nginx
NAME   ENDPOINTS          AGE
web    192.168.1.9:8080   39s
❯

Create a pod on the TKC and try to access the svc web from inside the pod. I've already deployed a nginx pod.

❯ gcc k get po nginx
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          96m
❯
❯ gcc k exec -it nginx -- curl 10.105.243.33:8080
Hello, world!
Version: 1.0.0
Hostname: web-746c8679d4-ptmgh
❯

Lets create a second deployment under namespace ingress-nginx.

❯ gcc kubectl create deployment web2 --image=gcr.io/google-samples/hello-app:2.0 -n ingress-nginx
deployment.apps/web2 created
❯
❯ gcc kubectl expose deployment web2 --port=8080 --type=NodePort -n ingress-nginx
service/web2 exposed
❯
❯
❯ gcc k get deployment web2 -n ingress-nginx
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
web2   1/1     1            1           56s
❯ gcc k get svc  web2 -n ingress-nginx
NAME   TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
web2   NodePort   10.99.79.19   <none>        8080:31695/TCP   65s
❯ gcc k get ep  web2 -n ingress-nginx
NAME   ENDPOINTS           AGE
web2   192.168.2.13:8080   73s

Verify svc web2.

❯ gcc k exec -it nginx -- curl 10.99.79.19:8080
Hello, world!
Version: 2.0.0
Hostname: web2-5858b4c7c5-tmn8x

Service web and web2 are accessible within the TKC. We've already verified it from the nginx pod that runs within the same TKC.

Now, we will create an ingress resource under namespace ingress-nginx.

❯ cat ing-01.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello-world-ing
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: hello-world.info
    http:
      paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: web
              port:
                number: 8080
        - path: /v2
          pathType: Prefix
          backend:
            service:
              name: web2
              port:
                number: 8080

❯ gcc k create -f ing-01.yaml -n ingress-nginx
ingress.networking.k8s.io/hello-world-ing created
❯
❯ gcc k get ing -n ingress-nginx
NAME              CLASS    HOSTS              ADDRESS   PORTS   AGE
hello-world-ing   <none>   hello-world.info             80      55s
❯ gcc k get ing -n ingress-nginx
NAME              CLASS    HOSTS              ADDRESS         PORTS   AGE
hello-world-ing   <none>   hello-world.info   10.186.124.41   80      56s

I've created a entry in /etc/hosts file in my laptop so that hello-world.info resolves to 10.186.124.41 which is the external IP of service/ingress-nginx-controller.

❯ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1	localhost
255.255.255.255	broadcasthost
::1             localhost
# Added by Docker Desktop
# To allow the same kube context to work on the host and the container:
127.0.0.1 kubernetes.docker.internal
10.186.124.41 hello-world.info
# End of section

Now from my laptop when I curl to hello-world.info, the request will be served by web svc, and when I curl to hello-world.info/v2, it will be served by web2 svc.

❯
❯ curl hello-world.info
Hello, world!
Version: 1.0.0
Hostname: web-746c8679d4-ptmgh
❯
❯ curl hello-world.info/v2
Hello, world!
Version: 2.0.0
Hostname: web2-5858b4c7c5-tmn8x
❯

Hope it was useful. Cheers!

References:

https://kubernetes.io/docs/tasks/access-application-cluster/ingress-minikube/
https://kubernetes.github.io/ingress-nginx/user-guide/basic-usage/

Saturday, December 10, 2022

vSphere with Tanzu using NSX-T - Part21 - Pointers while upgrading the stack

Here are some pointers.

The main workflow while upgrading the entire stack is:

upgrade NSX-T
upgrade vCenter server
upgrade ESXi nodes
upgrade vSAN and vDS (if required)
upgrade WCP

Note: Make sure you follow the compatibility matrix while performing upgrades in a production environment.

NSX-T

While NSX-T is getting upgraded, the host transport nodes will also need to be updated. NSX components will be installed/ updated on the ESXi nodes and may need a reboot also.

In NSX-T: System > Lifecycle Management > Upgrade > Upgrade NSX > Hosts

Plan

-Upgrade order across groups - Serial

-Pause upgrade condition - When an upgrade unit fails to upgrade

Host Groups

-Upgrade order within group - Serial

-Upgrade mode - Maintenance

Notes:

There is an in-place upgrade mode for the host groups, and in this mode the NSX component on the ESXi nodes will get updated first and then the node will be placed in MM and then gets rebooted. But we have had some cases where the NSX component update on the ESXi node fails, and the workload/ TKC VMs running on that ESXi loses network connectivity. To fix this the ESXi node has to be rebooted, and while trying to put the node in MM, the TKC VMs running on it fails to get migrated, and the final option is to force reboot the ESXi and that affects the workload running on it. To avoid this case, the safest option is to choose the upgrade mode as Maintenance, so that the TKCs VMs will get migrated off the ESXi node first before updating the NSX component, and even if the component update fails, we are safe to reboot the node as it is in already in MM.

We have also noticed in some cases the ESXi nodes in a WCP cluster fails to enter MM. Following are some common cases:

Orphaned or inaccessible VMs are one of the main causes. You can follow this article to find those and clean them up.

I have noticed cases like some VMs (vSphere pods) were present on the ESXi host in poweredoff state. In this case the ESXi was stuck 100% in maintenance mode. You will need to check on those VMs and then delete them if required to proceed further.

There are cases where some TKC VMs were present on the ESXi host and they fail to get migrated. Check whether these VMs are still present at the Kubernetes layer. If not, delete those VMs!

Check if there are any TKC nodes still present on the ESXi node. There can be cases where the TKC node or nodes are unable to migrate to other available nodes in the cluster due to resource constraints. Check whether the TKC nodes that are unable to migrate is using guaranteed vmclass. The solution is to find out unused resources/ TKCs in the cluster, check with the owner/ user and then delete it if not being used. Another way is to check with the user and change the guaranteed vmclass to besteffort if possible or temporarily reduce the number of worker nodes if the user agrees to do so.
Verify whether any pods are running on the respective ESXi worker node: kubectl get pods -A -o wide | grep ESXi-FQDNSafely drain the node.

Hope it was useful. Cheers!

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state.

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
       5

❯ gcc kg no
NAME                                STATUS                        ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready                         control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-tnhv9              NotReady                      control-plane,master   63d    v1.20.9+vmware.1
gc-control-plane-tqvnk              NotReady                      control-plane,master   50d    v1.20.9+vmware.1
gc-control-plane-wsclb              NotReady                      <none>                 8d     v1.20.9+vmware.1
gc-control-plane-wt6sx              NotReady                      <none>                 30d    v1.20.9+vmware.1
gc-control-plane-zthnq              NotReady                      control-plane,master   49d    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                         <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                         <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                         <none>                 458d   v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready    control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯
❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01    gc       updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-01-01T19:19:45Z
    Status:                True
    Type:                  AddonsReady
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-07-24T15:53:06Z
    Status:                True
    Type:                  NodePoolsReady
    Last Transition Time:  2022-09-01T09:02:26Z
    Message:               3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
    Status:                True
    Type:                  NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
 gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME                                STATUS                     ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready,SchedulingDisabled   control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                      <none>                 458d   v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted
❯
❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯

After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE          AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running        124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running        123d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc                                                                                             Provisioning   13s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running        458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running        456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running        458d   v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   124d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc        gc-control-plane-dnr66              vsphere://42011228-b156-3338-752a-e7233c9258dd   Running   2m2s   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS     ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready      control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              NotReady   control-plane,master   35s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready      <none>                 458d   v1.20.9+vmware.1


❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              Ready    control-plane,master   53s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01     gc     running      2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Hope it was useful. Cheers!

Pages

Saturday, April 8, 2023

IMPORTANT NOTES:

Symptom:

Troubleshooting:

Workaround:

Verify:

References:

Saturday, March 18, 2023

Restart all deployments in a namespace

Restart all daemonsets in a namespace

Saturday, March 11, 2023

Saturday, February 4, 2023

Saturday, January 7, 2023

Saturday, December 10, 2022

NSX-T

Plan

Host Groups

Sunday, November 13, 2022