In the previous posts we discussed the following:
- Part1 - Prerequisites
- Part2 - Configure NSX
- Part3 - Edge Cluster
- Part4 - Tier-0 Gateway and BGP peering
- Part5 - Tier-1 Gateway and Segments
- Part6 - Create tags, storage policy, and content library
- Part7 - Enable workload management
- Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
- Part9 - Monitoring
- Part10 - Upgrade Tanzu Kubernetes Cluster
In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.
List all TKCs that are stuck at creating/ updating:
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating
On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.
alias
kgtkc='kubectl get tkc -A -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas
--sort-by="metadata.creationTimestamp"'
You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:
% kgtkc | grep updating
c1nsxtest1-sla gc updating 2021-01-21T08:23:37Z v1.19.7+vmware.1-tkg.2.f52f85a 3 3
w2cei-sep20 gc updating 2021-09-16T17:48:07Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4
For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:
See events in a namespace:
kubectl get events -n <namespace>
See all events:
kubectl get events -A
Watch events in a namespace:
kubectl get events -n <namespace> -w
List the Cluster API resources supporting the clusters in the current namespace:
kubectl get cluster-api -n <namespace>
Describe TKC:
kubectl describe tkc <tkc_name> -n <namespace>
List TKC virtual machines in a namespace:
kubectl get vm -n <namespace>
List TKC virtual machines in a namespace with its IP:
kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
List all nodes of a cluster:
kubectl get nodes -o wide
List all pods that are not running:
kubectl get pods -A | grep -vi running
List health status of different cluster components:
kubectl get --raw '/healthz?verbose'
% kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed
List all CRDs installed in your cluster and their API versions:
kubectl api-resources -o wide --sort-by="name"
List available Tanzu Kubernetes releases:
kubectl get tanzukubernetesreleases
List available virtual machine images:
kubectl get virtualmachineimages
List terminating namespaces:
kubectl get ns --field-selector status.phase=Terminating
You can ssh to the Tanzu Kubernetes cluster nodes as the system user following this:
https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html
Here is an example where I have a TKC under namespace: vineetha-test05-deploy
% kubectl get tkc -n vineetha-test05-deploy
NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
gc 1 3 v1.20.9---vmware.1-tkg.1.a4cee5b 4d5h True True [1.21.2+vmware.1-tkg.1.ee25d55]
% kubectl get vm -n vineetha-test05-deploy -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
[
{
"namespace": "vineetha-test05-deploy",
"name": "gc-control-plane-ttkmt",
"internalIP": "172.29.4.194"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-d286z",
"internalIP": "172.29.4.195"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-hwr8b",
"internalIP": "172.29.4.197"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-r99x7",
"internalIP": "172.29.4.196"
}
]
apiVersion: v1
kind: Pod
metadata:
name: jumpbox
namespace: vineetha-test05-deploy #REPLACE
spec:
containers:
- image: "photon:3.0"
name: jumpbox
command: [ "/bin/bash", "-c", "--" ]
args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
volumeMounts:
- mountPath: "/root/ssh"
name: ssh-key
readOnly: true
resources:
requests:
memory: 2Gi
volumes:
- name: ssh-key
secret:
secretName: gc-ssh #REPLACE
Once you apply the above yaml, you can see the jumpbox pod.
% kubectl get pod -n vineetha-test05-deploy
NAME READY STATUS RESTARTS AGE
jumpbox 1/1 Running 0 22m
Now, you can connect to the TKC node with its internal IP.
% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34
16:50:34 up 4 days, 5:49, 0 users, load average: 2.14, 0.97, 0.65
26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname
gc-control-plane-ttkmt
You can check the status of control plane pods using crictl ps.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
bde228417c55a 9000c334d9197 4 days ago Running guest-cluster-auth-service 0 d7abf3db8670d
bc4b8c1bf0e33 a294c1cf07bd6 4 days ago Running metrics-server 0 2665876cf939e
46a94dcf02f3e 92cb72974660c 4 days ago Running coredns 0 7497cdf3269ab
f7d32016d6fb7 f48f23686df21 4 days ago Running csi-resizer 0 b887d394d4f80
ef80f62f3ed65 2cba51b244f27 4 days ago Running csi-provisioner 0 b887d394d4f80
64b570add2859 4d2e937854849 4 days ago Running liveness-probe 0 b887d394d4f80
c0c1db3aac161 d032188289eb5 4 days ago Running vsphere-syncer 0 b887d394d4f80
e4df023ada129 e75228f70c0d6 4 days ago Running vsphere-csi-controller 0 b887d394d4f80
e79b3cfdb4143 8a857a48ee57f 4 days ago Running csi-attacher 0 b887d394d4f80
96e4af8792cd0 b8bffc9e5af52 4 days ago Running calico-kube-controllers 0 b5e467a43b34a
23791d5648ebb 92cb72974660c 4 days ago Running coredns 0 9bde50bbfb914
0f47d11dc211b ab1e2f4eb3589 4 days ago Running guest-cluster-cloud-provider 0 fde68175c5d95
5ddfd46647e80 4d2e937854849 4 days ago Running liveness-probe 0 1a88f26173762
578ddeeef5bdd e75228f70c0d6 4 days ago Running vsphere-csi-node 0 1a88f26173762
3fcb8a287ea48 9a3d9174ac1e7 4 days ago Running node-driver-registrar 0 1a88f26173762
91b490c14d085 dc02a60cdbe40 4 days ago Running calico-node 0 35cf458eb80f8
68dbbdb779484 f7ad2965f3ac0 4 days ago Running kube-proxy 0 79f129c96e6e1
ef423f4aeb128 75bfe47a404bb 4 days ago Running docker-registry 0 752724fbbcd6a
26dd8e1f521f5 9358496e81774 4 days ago Running kube-apiserver 0 814e5d2be5eab
62745db4234e2 ab8fb8e444396 4 days ago Running kube-controller-manager 0 94543f93f7563
f2fc30c2854bd 9aa6da547b7eb 4 days ago Running etcd 0 f0a756a4cdc09
b8038e9f90e15 212d4c357a28e 4 days ago Running kube-scheduler 0 533a44c70e86c
You can check the status of kubelet and containerd services:
sudo systemctl status kubelet.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status kubelet.service
WARNING: terminal is not fully functional
- (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:>
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago
Docs: http://kubernetes.io/docs/
Main PID: 2234 (kubelet)
Tasks: 16 (limit: 4728)
Memory: 88.6M
CGroup: /system.slice/kubelet.service
└─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785 >
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045 >
sudo systemctl status containerd.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status containerd.service
WARNING: terminal is not fully functional
- (press RETURN)● containerd.service - containerd container runtime
Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres>
Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago
Docs: https://containerd.io
Main PID: 1783 (containerd)
Tasks: 386 (limit: 4728)
Memory: 639.3M
CGroup: /system.slice/containerd.service
├─ 1783 /usr/local/bin/containerd
├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:
vmware-system-user@gc-control-plane-ttkmt [ /var/log ]$ ls
audit devicelist sa vmware-vgauthsvc.log.0
auth.log journal sgidlist vmware-vmsvc-root.log
btmp kubernetes stigreport.log vmware-vmtoolsd-root.log
cloud-init.log lastlog suidlist wtmp
cloud-init-output.log pods tallylog
containers private vmware-imc
cron rpmcheck vmware-network.log
Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-1
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-2
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-3
Hope it was useful. Cheers!