Saturday, September 25, 2021

vSphere with Tanzu using NSX-T - Part11 - Troubleshooting Tanzu Kubernetes Clusters

In the previous posts we discussed the following:

In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.

List all TKCs that are stuck at creating/ updating:
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating

On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.

alias kgtkc='kubectl get tkc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas --sort-by="metadata.creationTimestamp"'

You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:

% kgtkc | grep updating
c1nsxtest1-sla                     gc                            updating   2021-01-21T08:23:37Z   v1.19.7+vmware.1-tkg.2.f52f85a    3     3
w2cei-sep20                       gc                            updating   2021-09-16T17:48:07Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     4

For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:

See events in a namespace:
kubectl get events -n <namespace>

See all events:
kubectl get events -A

Watch events in a namespace:
kubectl get events -n <namespace> -w

List the Cluster API resources supporting the clusters in the current namespace:
kubectl get cluster-api -n <namespace>

Describe TKC:
kubectl describe tkc <tkc_name> -n <namespace>

List TKC virtual machines in a namespace:
kubectl get vm -n <namespace>

List TKC virtual machines in a namespace with its IP:

kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'

List all nodes of a cluster:
kubectl get nodes -o wide

List all pods that are not running:
kubectl get pods -A | grep -vi running

List health status of different cluster components:
kubectl get --raw '/healthz?verbose'

% kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

List all CRDs installed in your cluster and their API versions:
kubectl api-resources -o wide --sort-by="name"

List available Tanzu Kubernetes releases:
kubectl get tanzukubernetesreleases

List available virtual machine images:
kubectl get virtualmachineimages

List terminating namespaces:

kubectl get ns --field-selector status.phase=Terminating

You can ssh to the Tanzu Kubernetes cluster nodes as the system user following this:
https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html

Here is an example where I have a TKC under namespace: vineetha-test05-deploy

% kubectl get tkc -n vineetha-test05-deploy
NAME   CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
gc     1               3        v1.20.9---vmware.1-tkg.1.a4cee5b   4d5h   True    True             [1.21.2+vmware.1-tkg.1.ee25d55]

% kubectl get vm -n vineetha-test05-deploy -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
[
  {
    "namespace": "vineetha-test05-deploy",
    "name": "gc-control-plane-ttkmt",
    "internalIP": "172.29.4.194"
  },
  {
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-d286z",
    "internalIP": "172.29.4.195"
  },
  {
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-hwr8b",
    "internalIP": "172.29.4.197"
  },
  {
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-r99x7",
    "internalIP": "172.29.4.196"
  }
]

 
Given below is the yaml file that deploys a pod named jumpbox under the supervisor namespace vineetha-test05-deploy, and from there you can ssh to the TKC nodes. 

apiVersion: v1
kind: Pod
metadata:
  name: jumpbox
  namespace: vineetha-test05-deploy           #REPLACE
spec:
  containers:
  - image: "photon:3.0"
    name: jumpbox
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
    volumeMounts:
      - mountPath: "/root/ssh"
        name: ssh-key
        readOnly: true
    resources:
      requests:
        memory: 2Gi
  volumes:
    - name: ssh-key
      secret:
        secretName: gc-ssh     #REPLACE


Once you apply the above yaml, you can see the jumpbox pod.

% kubectl get pod -n vineetha-test05-deploy                                                                                                              
NAME      READY   STATUS    RESTARTS   AGE
jumpbox   1/1     Running   0          22m

Now, you can connect to the TKC node with its internal IP.

% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194                                             
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34
 16:50:34 up 4 days,  5:49,  0 users,  load average: 2.14, 0.97, 0.65

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname
gc-control-plane-ttkmt

You can check the status of control plane pods using crictl ps.

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
bde228417c55a       9000c334d9197       4 days ago          Running             guest-cluster-auth-service     0                   d7abf3db8670d
bc4b8c1bf0e33       a294c1cf07bd6       4 days ago          Running             metrics-server                 0                   2665876cf939e
46a94dcf02f3e       92cb72974660c       4 days ago          Running             coredns                        0                   7497cdf3269ab
f7d32016d6fb7       f48f23686df21       4 days ago          Running             csi-resizer                    0                   b887d394d4f80
ef80f62f3ed65       2cba51b244f27       4 days ago          Running             csi-provisioner                0                   b887d394d4f80
64b570add2859       4d2e937854849       4 days ago          Running             liveness-probe                 0                   b887d394d4f80
c0c1db3aac161       d032188289eb5       4 days ago          Running             vsphere-syncer                 0                   b887d394d4f80
e4df023ada129       e75228f70c0d6       4 days ago          Running             vsphere-csi-controller         0                   b887d394d4f80
e79b3cfdb4143       8a857a48ee57f       4 days ago          Running             csi-attacher                   0                   b887d394d4f80
96e4af8792cd0       b8bffc9e5af52       4 days ago          Running             calico-kube-controllers        0                   b5e467a43b34a
23791d5648ebb       92cb72974660c       4 days ago          Running             coredns                        0                   9bde50bbfb914
0f47d11dc211b       ab1e2f4eb3589       4 days ago          Running             guest-cluster-cloud-provider   0                   fde68175c5d95
5ddfd46647e80       4d2e937854849       4 days ago          Running             liveness-probe                 0                   1a88f26173762
578ddeeef5bdd       e75228f70c0d6       4 days ago          Running             vsphere-csi-node               0                   1a88f26173762
3fcb8a287ea48       9a3d9174ac1e7       4 days ago          Running             node-driver-registrar          0                   1a88f26173762
91b490c14d085       dc02a60cdbe40       4 days ago          Running             calico-node                    0                   35cf458eb80f8
68dbbdb779484       f7ad2965f3ac0       4 days ago          Running             kube-proxy                     0                   79f129c96e6e1
ef423f4aeb128       75bfe47a404bb       4 days ago          Running             docker-registry                0                   752724fbbcd6a
26dd8e1f521f5       9358496e81774       4 days ago          Running             kube-apiserver                 0                   814e5d2be5eab
62745db4234e2       ab8fb8e444396       4 days ago          Running             kube-controller-manager        0                   94543f93f7563
f2fc30c2854bd       9aa6da547b7eb       4 days ago          Running             etcd                           0                   f0a756a4cdc09
b8038e9f90e15       212d4c357a28e       4 days ago          Running             kube-scheduler                 0                   533a44c70e86c

You can check the status of kubelet and containerd services:
sudo systemctl status kubelet.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status kubelet.service                                  
WARNING: terminal is not fully functional
-  (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:>
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago
     Docs: http://kubernetes.io/docs/
 Main PID: 2234 (kubelet)
    Tasks: 16 (limit: 4728)
   Memory: 88.6M
   CGroup: /system.slice/kubelet.service
           └─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>

Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785    >
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045    >


sudo systemctl status containerd.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status containerd.service                               
WARNING: terminal is not fully functional
-  (press RETURN)● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres>
   Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago
     Docs: https://containerd.io
 Main PID: 1783 (containerd)
    Tasks: 386 (limit: 4728)
   Memory: 639.3M
   CGroup: /system.slice/containerd.service
           ├─ 1783 /usr/local/bin/containerd
           ├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
           ├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>


If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:

vmware-system-user@gc-control-plane-ttkmt [ /var/log ]$ ls
audit                  devicelist  sa                  vmware-vgauthsvc.log.0
auth.log               journal     sgidlist            vmware-vmsvc-root.log
btmp                   kubernetes  stigreport.log      vmware-vmtoolsd-root.log
cloud-init.log         lastlog     suidlist            wtmp
cloud-init-output.log  pods        tallylog
containers             private     vmware-imc
cron                   rpmcheck    vmware-network.log


Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-1


https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-2


https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-3

 



Hope it was useful. Cheers!

Tuesday, August 3, 2021

vSphere with Tanzu using NSX-T - Part10 - Upgrade K8s version of Tanzu Kubernetes cluster

In the previous posts we discussed the following:

Part1 - Prerequisites

Part2 - Configure NSX

Part3 - Edge Cluster

Part4 - Tier-0 Gateway and BGP peering

Part5 - Tier-1 Gateway and Segments

Part6 - Create tags, storage policy, and content library

Part7 - Enable workload management


In this article, I will explain how to upgrade the K8s version of a Tanzu Kubernetes cluster.

Verify the current K8s version of the Tanzu Kubernetes cluster.


Check available Tanzu Kubernetes versions.


Edit the cluster manifest file.


Here we are updating from 1.18.5 to 1.18.15.


Save the manifest file.


You can see the corresponding cluster starts updating.


The cluster will get updated to the newer version in a rolling fashion. The control plane node gets updated first, followed by the worker nodes one by one. A new node will be added to the cluster with new version, and an old node will be removed from the cluster.
 


Verify.

As you can see, the tkg-cluster-02 is upgraded from 1.18.5 to 1.18.15.


Hope it was useful. Cheers!

References

Wednesday, July 21, 2021

VMware PowerCLI 101 - part9 - Working with NSX-T

Note I am using the following versions:

PSVersion: 7.1.3
VMware PowerCLI: 
12.3.0.17860403

Connect-NsxtServer -Server 192.168.41.8


Get-Module "VMware.VimAutomation.Nsx*" -ListAvailable
Get-Command -Module "VMware.VimAutomation.Nsxt"


Get-NsxtService | measure
Get-NsxtService | more


Get-NsxtService com.vmware.nsx.cluster
$t1 = Get-NsxtService com.vmware.nsx.cluster
$t1 | Get-Member
$t1.get()



$t1 = Get-NsxtService com.vmware.nsx.cluster.status
$t1.get()
$t1.get().mgmt_cluster_status
$t1.get().control_cluster_status


$t1 = Get-NsxtService com.vmware.nsx.capacity.usage
$t1.get().capacity_usage | select usage_type, display_name, current_usage_count, max_supported_count, current_usage_percentage,severity | ft


$t1 = Get-NsxtService com.vmware.nsx.alarms
$t1.list().results | select feature_name, event_type, summary, severity, status | ft


Hope it was useful. Cheers!

References

Sunday, June 27, 2021

vSphere with Tanzu using NSX-T - Part9 - Monitoring

In the previous posts we discussed the following:

Part1 - Prerequisites

Part2 - Configure NSX

Part3 - Edge Cluster

Part4 - Tier-0 Gateway and BGP peering

Part5 - Tier-1 Gateway and Segments

Part6 - Create tags, storage policy, and content library

Part7 - Enable workload management


In this article, I will explain some of the popular tools used for monitoring Kubernetes clusters that provides insight into different objects in K8s, status, metrics, logs, and so on.

  • Lens
  • Octant
  • Prometheus and Grafana
  • vROps and Kubernetes Management Pack
  • Kubebox


-Lens-

Download the Lens binary file from: https://k8slens.dev/


I am installing it on a Windows server. Once the installation is complete, the first thing you have to do is to provide the Kube config file details so that Lens can connect to the Kubernetes cluster and start monitoring it.

Add Cluster

Click File - Add Cluster


You can either browse and select the Kube config file or you can paste the content of your Kube config file as text. I am just pasting it as text.

 

Once you have pasted your Kube config file contents, make sure to select the context, and then click Add cluster.


Deploy Prometheus stack

If you aren't seeing CPU and memory metrics, you will need to install the Prometheus stack on your K8s cluster. And Lens has a feature that deploys the Prometheus stack on your K8s cluster with the click of a button!

Select the cluster icon and click Settings.


Scroll all the way to the end, and under Features, you will find an Install button. In my case, I've already installed it, that's why it's showing the Uninstall button.


Once you click the Install button, Lens will go ahead and install the Prometheus stack on the selected K8s cluster. After few minutes, you should be able to see all the metrics.

You can see a namespace called "lens-metrics" and under that, the Prometheus stack components are deployed.


Following are the service objects that are created as part of the Prometheus stack deployment.


And, here is the PVC that is attached to the Prometheus pod.


Terminal access

Click on Terminal to get access directly to the K8s cluster.


Pod metrics, SSH to the pod, and container logs



Scaling
 
Note: In a production environment, it is always a best practice to apply configuration changes to your K8s cluster objects through a version control system.


You can also see the Service Accounts, Roles, Role Bindings, and PSPs under the Access Control tab. For more details see https://docs.k8slens.dev/main/.


-Octant-

https://vineethac.blogspot.com/2020/08/visualize-your-kubernetes-clusters-and.html


-Prometheus and Grafana-



-vROps and Kubernetes Management Pack-

https://blogs.vmware.com/management/2020/12/announcing-the-vrealize-operations-management-pack-for-kubernetes-1-5-1.html

https://rudimartinsen.com/2021/03/07/vrops-kubernetes-mgmt-pack/

https://www.brockpeterson.com/post/vrops-management-pack-for-kubernetes


-Kubebox-


curl -Lo kubebox https://github.com/astefanutti/kubebox/releases/download/v0.9.0/kubebox-linux && chmod +x kubebox


Select namespace


Select Pod

This will show the selected pod metrics and logs.


Note: Kubebox relies on cAdvisor to retrieve the resource usage metrics. It’s recommended to use the provided cadvisor.yaml file, that’s tested to work with Kubebox. 

kubectl apply -f https://raw.github.com/astefanutti/kubebox/master/cadvisor.yaml

Kubebox: https://github.com/astefanutti/kubebox

Hope it was useful. Cheers!

Monday, June 21, 2021

Validate your Kubernetes cluster using Sonobuoy

Sonobuoy is a diagnostic tool that helps to validate the state of a Kubernetes cluster by running a set of tests in an accessible and non-destructive manner. By default, Sonobuoy runs the Kubernetes conformance tests. The conformance testing ensures that a cluster is properly configured and that its behavior conforms to official Kubernetes specifications. It also helps ensure that a Kubernetes cluster meets the minimal set of features. They are a subset of end-to-end (e2e) tests that should pass on any Kubernetes cluster. 

A conformance-passing cluster provides the guarantee that your Kubernetes is properly configured as per best practices. There are around 275 tests that need to be passed for qualifying Kubernetes conformance.

Install Sonobuoy

wget https://github.com/vmware-tanzu/sonobuoy/releases/download/v0.51.0/sonobuoy_0.51.0_linux_amd64.tar.gz
tar -xvf sonobuoy_0.51.0_linux_amd64.tar.gz

Note: I am installing Sonobuoy on CentOS Linux release 7.9.2009 (Core).


Help
/root/sonobuoy --help

Run Sonobuoy
/root/sonobuoy run --wait

Note: e2e test takes around 60-90 minutes to complete.


Sonobuoy Objects
kubectl get all -n sonobuoy


kubectl get pods -n sonobuoy -o wide


Sonobuoy Status
/root/sonobuoy status
/root/sonobuoy status --json
/root/sonobuoy status --json | jq

Note: If you are getting this while using jq "bash: jq: command not found..." , follow this blog to install jq.


Inspect Logs
/root/sonobuoy logs

Sonobuoy Results
results=$(/root/sonobuoy retrieve)


/root/sonobuoy results $results
/root/sonobuoy results <tar ball file>



See passed/ failed tests
/root/sonobuoy results <tar ball file> --mode=detailed | jq 'select(.status=="passed")' /root/sonobuoy results <tar ball file> --mode=detailed | jq 'select(.status=="failed")'


List the conformance tests
/root/sonobuoy results <tar ball file> --mode=detailed| jq 'select(.name | contains("[Conformance]"))'

Cleanup
/root/sonobuoy delete --wait


References

https://github.com/vmware-tanzu/sonobuoy
https://sonobuoy.io/docs/v0.51.0/


Friday, June 11, 2021

Index

Generative AI and LLMs


Kubernetes



vRealize Operations (vROps)



PowerShell