Each node is of size best-effort-2xlarge which has 8 vCPU and 64Gi of memory.
❯ KUBECONFIG=gckubeconfig k get node
NAME STATUS ROLES AGE VERSION
tkc01-control-plane-49jx4 Ready control-plane,master 97d v1.23.8+vmware.3
tkc01-control-plane-m8wmt Ready control-plane,master 105d v1.23.8+vmware.3
tkc01-control-plane-z6gxx Ready control-plane,master 97d v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-8gjn8 Ready <none> 21d v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-c9nfq Ready <none> 21d v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-cngff Ready <none> 21d v1.23.8+vmware.3
❯
I've attached 256Gi storage volumes to the worker nodes that is mounted at /var/lib/containerd. The worker nodes on which these llm pods are running should have enough storage space. Otherwise you may notice these pods getting stuck/ restarting/ unknownstatus. If the worker nodes run out of the storage disk space, you will see pods getting evicted with warnings The node was low on resource: ephemeral-storage. TKC spec is available in the above mentioned Git repo.
This works on a CPU powered Kubernetes cluster. Additional configurations might be required if you want to run this on a GPU powered cluster.
We have already instrumented the Readiness and Liveness functionality in the LLM app itself.
The readiness probe invokes the /healthz endpoint exposed by the FastAPI app. This will make sure the FastAPI itself is healthy/ responding to the API calls.
The liveness probe invokes liveness.py script within the app. The script invokes the /ask endpoint which interacts with the LLM and returns the response. This will make sure the LLM is responding to the user queries. For some reason if the llm is not responding/ hangs, the liveness probe will fail and eventually it will restart the container.
You can apply the deployment yaml spec as follows:
❯ KUBECONFIG=gckubeconfig k apply -f fastapi-llm-app-deploy-cpu.yaml
Validation
❯ KUBECONFIG=gckubeconfig k get deploy fastapi-llm-app
NAME READY UP-TO-DATE AVAILABLE AGE
fastapi-llm-app 2/2 2 2 21d
❯
❯ KUBECONFIG=gckubeconfig k get pods | grep fastapi-llm-app
fastapi-llm-app-758c7c58f7-79gmq 1/1 Running 1 (71m ago) 13d
fastapi-llm-app-758c7c58f7-gqdc6 1/1 Running 1 (99m ago) 13d
❯
❯ KUBECONFIG=gckubeconfig k get svc fastapi-llm-app
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fastapi-llm-app LoadBalancer 10.110.228.33 10.216.24.104 5000:30590/TCP 5h24m
❯
Now you can just do a curl against the EXTERNAL-IP of the above mentioned fastapi-llm-app service.
❯ curl http://10.216.24.104:5000/ask -X POST -H "Content-Type: application/json" -d '{"text":"list comprehension examples in python"}'
In our next blog post, we'll try enhancing our FastAPI application with robust instrumentation. Specifically, we'll explore the process of integrating FastAPI metrics into our application, allowing us to gain valuable insights into its performance and usage metrics. Furthermore, we'll take a look at incorporating traces using OpenTelemetry, a powerful tool for distributed tracing and observability in modern applications. By leveraging OpenTelemetry, we'll be able to gain comprehensive visibility into the behavior of our application across distributed systems, enabling us to identify performance bottlenecks and optimize resource utilization.
Stay tuned for an insightful exploration of FastAPI metrics instrumentation and OpenTelemetry integration in our upcoming blog post!
def get_all_pods(v1): print("Listing all pods:") ret = v1.list_pod_for_all_namespaces(watch=False) for i in ret.items: print(i.metadata.namespace, i.metadata.name, i.status.phase)
def get_namespaced_pods(v1, ns): print(f"Listing all pods under namespace {ns}:") ret = v1.list_namespaced_pod(f"{ns}") for i in ret.items: print(i.metadata.namespace, i.metadata.name, i.status.phase)
When describing the pod, you can see the message "Unable to find backing for logical switch".
❯ ❯ kd po gatekeeper-controller-manager-5ccbc7fd79-5gn2n -n svc-opa-gatekeeper-domain-c61 Name: gatekeeper-controller-manager-5ccbc7fd79-5gn2n Namespace: svc-opa-gatekeeper-domain-c61 Priority: 2000000000 Priority Class Name: system-cluster-critical Node: esx-1.sddc-35-82-xxxxx.xxxxxxx.com/ Labels: control-plane=controller-manager gatekeeper.sh/operation=webhook gatekeeper.sh/system=yes pod-template-hash=5ccbc7fd79 Annotations: attachment_id: 668b681b-fef6-43e5-8009-5ac8deb6da11 kubernetes.io/psp: wcp-default-psp mac: 04:50:56:00:08:1e vlan: None vmware-system-ephemeral-disk-uuid: 6000C297-d1ba-ce8c-97ba-683a3c8f5321 vmware-system-image-references: {"manager":"gatekeeper-111fd0f684141bdad12c811b4f954ae3d60a6c27-v52049"} vmware-system-vm-moid: vm-89777:750f38c6-3b0e-41b7-a94f-4d4aef08e19b vmware-system-vm-uuid: 500c9c37-7055-1708-92d4-8ffdf932c8f9 Status: Failed Reason: ProviderFailed Message: Unable to find backing for logical switch 03f0dcd4-a5d9-431e-ae9e-d796ddca0131: timed out waiting for the condition Unable to find backing for logical switch: 03f0dcd4-a5d9-431e-ae9e-d796ddca0131 IP: IPs: <none>
A workaround for this is to restart the spherelet service on the ESXi host where you see this issue. If there are multiple ESXi nodes having same issue, you could consider restarting the spherelet service on all ESXi worker nodes. In a production setup you may want to consider placing the ESXi in maintenance mode before restarting the spherelet service. In my case, we usually restart the spherelet service directly without placing the ESXi in MM. Following is the PowerCLI way to check/ restart spherelet service on ESXi worker nodes:
> Connect-VIServer wdc-10-vc21
> Get-VMHost | Get-VMHostService | where {$_.Key -eq "spherelet"} | select VMHost,Key,Running | ft
Perform operation? Perform operation Stop host service. on spherelet? [Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): Y
After restarting the spherelet service, new pods will come up fine and be in Running status. But you may need to clean up all those pods with ProviderFailed status using kubectl.
kubectl get pods -A | grep ProviderFailed | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.
Case 1:
TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.
Following is a sample log for this issue from the vmop-controller-manger:
Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error
This can be resolved by just editing the content library and accepting new certificate thumbprint.
Case 2:
Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.
❯ kubectl get tkr No resources found
This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL. If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.
You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.
Case 3:
TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:
E0803 18:51:30.638787 1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.638821 1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.638851 1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.639301 1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"
This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:
get service cm-inventory restart service cm-inventory
Case 4:
Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.
> Get-ContentLibrary | select Name,Id | fl
Name : wdc-01-vc18c01-wcp
Id : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d
> kg contentsources
NAME AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db 173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d 321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662 242d
75e0668c-0cdc-421e-965d-fd736187cc57 173d
818c8700-efa4-416b-b78f-5f22e9555952 173d
9abbd108-aeb3-4b50-b074-9e6c00473b02 173d
a6cd1685-49bf-455f-a316-65bcdefac7cf 173d
acff9a91-0966-4793-9c3a-eb5272b802bd 242d
fcc08a43-1555-4794-a1ae-551753af9c03 173d
In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.
In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.
List all TKCs that are stuck at creating/ updating: kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating
On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.
alias
kgtkc='kubectl get tkc -A -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas
--sort-by="metadata.creationTimestamp"'
You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:
For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:
See events in a namespace: kubectl get events -n <namespace>
See all events: kubectl get events -A Watch events in a namespace: kubectl get events -n <namespace> -w List the Cluster API resources supporting the clusters in the current namespace: kubectl get cluster-api -n <namespace>
List TKC virtual machines in a namespace: kubectl get vm -n <namespace> List TKC virtual machines in a namespace with its IP: kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
List all nodes of a cluster: kubectl get nodes -o wide
List all pods that are not running: kubectl get pods -A | grep -vi running
List health status of different cluster components: kubectl get --raw '/healthz?verbose'
% kubectl get --raw '/healthz?verbose' [+]ping ok [+]log ok [+]etcd ok [+]poststarthook/start-kube-apiserver-admission-initializer ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/priority-and-fairness-config-consumer ok [+]poststarthook/priority-and-fairness-filter ok [+]poststarthook/start-apiextensions-informers ok [+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/crd-informer-synced ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/rbac/bootstrap-roles ok [+]poststarthook/scheduling/bootstrap-system-priority-classes ok [+]poststarthook/priority-and-fairness-config-producer ok [+]poststarthook/start-cluster-authentication-info-controller ok [+]poststarthook/aggregator-reload-proxy-client-cert ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok [+]poststarthook/apiservice-openapi-controller ok healthz check passed
List all CRDs installed in your cluster and their API versions: kubectl api-resources -o wide --sort-by="name"
List available Tanzu Kubernetes releases: kubectl get tanzukubernetesreleases
List available virtual machine images: kubectl get virtualmachineimages List terminating namespaces: kubectl get ns --field-selector status.phase=Terminating
Here is an example where I have a TKC under namespace: vineetha-test05-deploy
% kubectl get tkc -n vineetha-test05-deploy NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE gc 1 3 v1.20.9---vmware.1-tkg.1.a4cee5b 4d5h True True [1.21.2+vmware.1-tkg.1.ee25d55]
Given below is the yaml file that deploys a pod named jumpbox under the supervisor namespace vineetha-test05-deploy, and from there you can ssh to the TKC nodes.
Once you apply the above yaml, you can see the jumpbox pod.
% kubectl get pod -n vineetha-test05-deploy NAME READY STATUS RESTARTS AGE jumpbox 1/1 Running 0 22m
Now, you can connect to the TKC node with its internal IP.
% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194 Welcome to Photon 3.0 (\m) - Kernel \r (\l) Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34 16:50:34 up 4 days, 5:49, 0 users, load average: 2.14, 0.97, 0.65
26 Security notice(s) Run 'tdnf updateinfo info' to see the details. vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname gc-control-plane-ttkmt
You can check the status of control plane pods using crictl ps.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID bde228417c55a 9000c334d9197 4 days ago Running guest-cluster-auth-service 0 d7abf3db8670d bc4b8c1bf0e33 a294c1cf07bd6 4 days ago Running metrics-server 0 2665876cf939e 46a94dcf02f3e 92cb72974660c 4 days ago Running coredns 0 7497cdf3269ab f7d32016d6fb7 f48f23686df21 4 days ago Running csi-resizer 0 b887d394d4f80 ef80f62f3ed65 2cba51b244f27 4 days ago Running csi-provisioner 0 b887d394d4f80 64b570add2859 4d2e937854849 4 days ago Running liveness-probe 0 b887d394d4f80 c0c1db3aac161 d032188289eb5 4 days ago Running vsphere-syncer 0 b887d394d4f80 e4df023ada129 e75228f70c0d6 4 days ago Running vsphere-csi-controller 0 b887d394d4f80 e79b3cfdb4143 8a857a48ee57f 4 days ago Running csi-attacher 0 b887d394d4f80 96e4af8792cd0 b8bffc9e5af52 4 days ago Running calico-kube-controllers 0 b5e467a43b34a 23791d5648ebb 92cb72974660c 4 days ago Running coredns 0 9bde50bbfb914 0f47d11dc211b ab1e2f4eb3589 4 days ago Running guest-cluster-cloud-provider 0 fde68175c5d95 5ddfd46647e80 4d2e937854849 4 days ago Running liveness-probe 0 1a88f26173762 578ddeeef5bdd e75228f70c0d6 4 days ago Running vsphere-csi-node 0 1a88f26173762 3fcb8a287ea48 9a3d9174ac1e7 4 days ago Running node-driver-registrar 0 1a88f26173762 91b490c14d085 dc02a60cdbe40 4 days ago Running calico-node 0 35cf458eb80f8 68dbbdb779484 f7ad2965f3ac0 4 days ago Running kube-proxy 0 79f129c96e6e1 ef423f4aeb128 75bfe47a404bb 4 days ago Running docker-registry 0 752724fbbcd6a 26dd8e1f521f5 9358496e81774 4 days ago Running kube-apiserver 0 814e5d2be5eab 62745db4234e2 ab8fb8e444396 4 days ago Running kube-controller-manager 0 94543f93f7563 f2fc30c2854bd 9aa6da547b7eb 4 days ago Running etcd 0 f0a756a4cdc09 b8038e9f90e15 212d4c357a28e 4 days ago Running kube-scheduler 0 533a44c70e86c
You can check the status of kubelet and containerd services: sudo systemctl status kubelet.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ <udo systemctl status kubelet.service WARNING: terminal is not fully functional - (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:> Drop-In: /etc/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago Docs: http://kubernetes.io/docs/ Main PID: 2234 (kubelet) Tasks: 16 (limit: 4728) Memory: 88.6M CGroup: /system.slice/kubelet.service └─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785 > Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045 >
sudo systemctl status containerd.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ <udo systemctl status containerd.service WARNING: terminal is not fully functional - (press RETURN)● containerd.service - containerd container runtime Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres> Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago Docs: https://containerd.io Main PID: 1783 (containerd) Tasks: 386 (limit: 4728) Memory: 639.3M CGroup: /system.slice/containerd.service ├─ 1783 /usr/local/bin/containerd ├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe> ├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:
Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.