Showing posts with label VMware. Show all posts
Showing posts with label VMware. Show all posts

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state. 

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
5

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-tnhv9 NotReady                    control-plane,master 63d v1.20.9+vmware.1
gc-control-plane-tqvnk NotReady                   control-plane,master 50d v1.20.9+vmware.1
gc-control-plane-wsclb NotReady                   <none> 8d v1.20.9+vmware.1
gc-control-plane-wt6sx NotReady                   <none> 30d v1.20.9+vmware.1
gc-control-plane-zthnq NotReady                   control-plane,master 49d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc updating 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: Ready
Last Transition Time: 2023-01-01T19:19:45Z
Status: True
Type: AddonsReady
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: ControlPlaneReady
Last Transition Time: 2022-07-24T15:53:06Z
Status: True
Type: NodePoolsReady
Last Transition Time: 2022-09-01T09:02:26Z
Message: 3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
Status: True
Type: NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.
❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready,SchedulingDisabled control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-control-plane-dnr66 gc Provisioning 13s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 124d v1.20.9+vmware.1
gc-control-plane-dnr66 gc gc-control-plane-dnr66 vsphere://42011228-b156-3338-752a-e7233c9258dd Running 2m2s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 NotReady control-plane,master 35s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 Ready control-plane,master 53s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc running 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3
Hope it was useful. Cheers!

Saturday, December 18, 2021

vSphere with Tanzu using NSX-T - Part13 - Export WCP admin kubeconfig

In the previous posts we discussed the following:

This article shows the steps to export WCP admin kubeconfig file from the supervisor control plane VM. This is the admin kubeconfig file that can be used to manage the Supervisor/ WCP K8s cluster.

Step1: SSH as root to the vCenter server.

Step2: Run the script /usr/lib/vmware-wcp/decryptK8Pwd.py and make a note of the IP and PWD.

Step3: SSH as root to the IP that you noted down from previous step, and then provide the password that you got from step2.

Step4: You can now copy the admin kubeconfig file from /etc/kubernetes/admin.conf file to your local machine. Make sure to modify the field server: https://127.0.0.1:6443 in your local admin.conf file to the IP that you got from step2 (server: https://IP_from_step2:6443). 

Note: If you are managing multiple WCP clusters, you can merge all the kubeconfig files. Refer this blog by Jacob Tomlinson for more details. 

Hope it was useful. Cheers!

Friday, December 10, 2021

ESXi in a HA cluster fails to Enter Maintenance Mode and gets stuck

Recently we came across a situation where when we try to put a ESXi host in Maintenance Mode, it is getting stuck at certain level. These ESXi nodes were part of a vSphere with Tanzu 7 U3 cluster. While troubleshooting we noticed that there are some VMs that are either orphaned or inaccessible running on it. We deleted those orphaned and inaccessible VMs and then the ESXi node enters Maintenance Mode successfully.

You can use VMware PowerCLI to list those orphaned and inaccessible VMs.

(Get-VMHost <host_fqdn> | Get-VM | Where {$_.ExtensionData.Summary.Runtime.ConnectionState -eq "orphaned"}) | select Name,Id,PowerState

(Get-VMHost <host_fqdn> | Get-VM | Where {$_.ExtensionData.Summary.Runtime.ConnectionState -eq "inaccessible"}) | select Name,Id,PowerState

We then deleted those orphaned and inaccessible VMs. You can try to delete them using Remove-VM command. 

Remove-VM -VM <vm_name> -DeletePermanently 

If that does not work, you can try with dcli.

dcli> com vmware vcenter vm delete --vm <vm-id>

Hope it was useful.

Friday, November 5, 2021

vSphere with Tanzu using NSX-T - Part12 - Deploy application on TKC and access it

In the previous posts we discussed the following:

This article walks you though the steps to deploy an application on Tanzu Kubernetes Cluster (TKC) and how to access it. I will try to explain how this all works under the hood.

Here I have a TKC cluster as shown below: 

% KUBECONFIG=gc.kubeconfig kg nodes                    
NAME                               STATUS   ROLES                  AGE   VERSION
gc-control-plane-pwngg             Ready    control-plane,master   49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 49d   v1.20.9+vmware.1

% KUBECONFIG=gc.kubeconfig kg nodes -o wide
NAME                               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION       CONTAINER-RUNTIME
gc-control-plane-pwngg             Ready    control-plane,master   49d   v1.20.9+vmware.1   172.29.21.194   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.195   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.196   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.197   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6

01 Create a namespace

% KUBECONFIG=gc.kubeconfig k create ns webserver
namespace/webserver created

% KUBECONFIG=gc.kubeconfig kg ns                
NAME                           STATUS   AGE
default                        Active   48d
kube-node-lease                Active   48d
kube-public                    Active   48d
kube-system                    Active   48d
vmware-system-auth             Active   48d
vmware-system-cloud-provider   Active   48d
vmware-system-csi              Active   48d
webserver                      Active   10s

02 Deploy nginx application

Following is the nginx-deployment.yaml spec to deploy nginx application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80

You can apply the yaml file as below:

% KUBECONFIG=gc.kubeconfig k apply -f nginx-deployment.yaml -n webserver
deployment.apps/my-nginx created

% KUBECONFIG=gc.kubeconfig kg deploy -n webserver                     
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
my-nginx   0/2     0            0           3m3s

% KUBECONFIG=gc.kubeconfig kg events -n webserver
LAST SEEN   TYPE      REASON              OBJECT                           MESSAGE
26s         Warning   FailedCreate        replicaset/my-nginx-74d7c6cb98   Error creating: pods "my-nginx-74d7c6cb98-" is forbidden: PodSecurityPolicy: unable to admit pod: []
3m10s       Normal    ScalingReplicaSet   deployment/my-nginx              Scaled up replica set my-nginx-74d7c6cb98 to 2

You can see that the pods failed to get created due to PodSecurityPolicy. Following is the psp.yaml spec to create ClusterRole and ClusterRoleBinding.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: psp:privileged
rules:
- apiGroups: ['policy']
  resources: ['podsecuritypolicies']
  verbs:     ['use']
  resourceNames:
  - vmware-system-privileged
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: all:psp:privileged
roleRef:
  kind: ClusterRole
  name: psp:privileged
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
  name: system:serviceaccounts
  apiGroup: rbac.authorization.k8s.io

Apply the yaml file as shown below:

% KUBECONFIG=gc.kubeconfig k apply -f psp.yaml
clusterrole.rbac.authorization.k8s.io/psp:privileged created
clusterrolebinding.rbac.authorization.k8s.io/all:psp:privileged created

Now, in few minutes you can see the deployment will get successful and two nginx pods will get deployed in the webserver namespace.

% KUBECONFIG=gc.kubeconfig kg deploy -n webserver
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
my-nginx   2/2     2            2           80m

% KUBECONFIG=gc.kubeconfig kg pods -n webserver -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP                NODE                               NOMINATED NODE   READINESS GATES
my-nginx-74d7c6cb98-lzghr   1/1     Running   0          67m   192.168.213.132   gc-workers-wrknn-f675446b6-rsf6n   <none>           <none>
my-nginx-74d7c6cb98-s59dt   1/1     Running   0          67m   192.168.67.196    gc-workers-wrknn-f675446b6-f6zqs   <none>           <none>
 

03 Access the application

You can access the application in many ways depending on the usecase.

---Port-forward---

% KUBECONFIG=gc.kubeconfig kubectl port-forward deployment/my-nginx -n webserver 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080

The deployment is port-forwarded now. If you open another terminal and do curl localhost:8080, you can see the nginx webpage.

% curl localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

You can also open a web browser with http://localhost:8080/ and you will get the same nginx webpage. Well port-forwarding is fine in a local dev test scenario, but you might not want to do it in a production setup. You will need to create a service that connects the application and to access it. 

Services

There are 3 types of services in Kubernetes.

  1. NodePort: Similar to port forwarding where a port on the worker node will be forwarded to the target port of the pod where the application is running.
  2. ClusterIP: This is useful if you want to access the application from within the cluster.
  3. LoadBalancer: This is used to provide access to external users. In my case, NSX-T will be providing this access.

---Service NodePort---

Following is the yaml spec file for service of type nodeport:

% cat nginx-service-np.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  labels:
    run: my-nginx
spec:
  type: NodePort
  ports:
  - targetPort: 80
    port: 80
    protocol: TCP
  selector:
    run: my-ngin
x

Apply the above yaml file.

% KUBECONFIG=gc.kubeconfig k apply -f nginx-service-np.yaml -n webserver
service/my-nginx created 

% KUBECONFIG=gc.kubeconfig kg svc -n webserver               
NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   NodePort   10.111.182.155   <none>        80:30741/TCP   4s

% KUBECONFIG=gc.kubeconfig kg ep -n webserver               
NAME       ENDPOINTS                              AGE
my-nginx   192.168.213.132:80,192.168.67.196:80   32m

As you can see, a service (my-nginx) of type NodePort is created. And, now the application should be accessible on port 30741 of any worker node. To verify it, first we need connectivity to the worker node IP. For connecting to worker nodes, we need to have a jumpbox pod deployed on the supervisor namespace. Once we have a jumpbox pod deployed on the sv namespace, we can ssh to TKC nodes from the jumpbox pod. You can follow my previous post to see how to create a jumpbox pod. Here is the link to VMware documentation for how to SSH to TKC nodes.

% KUBECONFIG=sv.kubeconfig k exec -it jumpbox -- sh
sh-4.4#     
sh-4.4# curl 172.29.21.197:30741
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
sh-4.4#

---Service ClusterIP---

Service of type ClusterIP will be accessible within the TKC. So, I will need to deploy a jumpbox pod/ test pod within the TKC and connect from there. First let me edit the svc my-nginx from NodePort to type ClusterIP.

% KUBECONFIG=gc.kubeconfig k edit svc my-nginx -n webserver
service/my-nginx edited

% KUBECONFIG=gc.kubeconfig kg svc -n webserver             
NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
my-nginx   ClusterIP   10.111.182.155   <none>        80/TCP    39m

I have already deploy a pod inside the TKC. As you can see, dnsutils is the pod that is deployed in the default namespace. We will connect to this pod and from there we can curl to the Cluster-IP of my-nginx service.

% KUBECONFIG=gc.kubeconfig kg pods                  
NAME       READY   STATUS    RESTARTS   AGE
dnsutils   1/1     Running   1          105m

% KUBECONFIG=gc.kubeconfig k exec -it dnsutils -- sh
#
# curl 10.111.182.155:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
#

Note: This service of type ClusterIP can be accessed only within the TKC, and not externally!

---Service LoadBalancer---

This is the way to expose your service to external users. In this case NSX-T will provide the external IP which will then internally forwarded to nginx pods through the my-nginx service.

I have edited the service my-nginx from type ClusterIP to LoadBalancer.

% KUBECONFIG=gc.kubeconfig k edit svc my-nginx -n webserver
service/my-nginx edited

% KUBECONFIG=gc.kubeconfig kg svc -n webserver             
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   LoadBalancer   10.111.182.155   <pending>     80:32398/TCP   56m

% KUBECONFIG=gc.kubeconfig kg svc -n webserver
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)        AGE
my-nginx   LoadBalancer   10.111.182.155   10.186.148.170   80:32398/TCP   56m

You can see that now the service has got an external ip. And, the end points of the service are as shown below, which is basically the nginx pod IPs.

% KUBECONFIG=gc.kubeconfig kg ep -n webserver
NAME       ENDPOINTS                              AGE
my-nginx   192.168.213.132:80,192.168.67.196:80   58m

% curl 10.186.148.170
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

I could also use the external IP 10.186.148.170 in a web browser to access the nginx webpage.

Now lets have a look at what is in the supervisor namespace. This TKC is created under a supervisor namespace "vineetha-test04-deploy".

% kubectl get svc -n vineetha-test04-deploy
NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE
gc-ba320a1e3e04259514411   LoadBalancer   172.28.5.217    10.186.148.170   80:31143/TCP     40h
gc-control-plane-service   LoadBalancer   172.28.9.37     10.186.149.120   6443:31639/TCP   51d

% kubectl get ep -n vineetha-test04-deploy  
NAME                       ENDPOINTS                                                     AGE
gc-ba320a1e3e04259514411   172.29.21.195:32398,172.29.21.196:32398,172.29.21.197:32398   40h
gc-control-plane-service   172.29.21.194:6443                                            51d

So what you are seeing is, for a service of type loadbalancer created inside the TKC, a service of type loadbalancer (gc-ba320a1e3e04259514411) will be automatically created under the supervisor namespace, and the its endpoints are the IP address of TKC worker nodes.


On the NSX-T side you can see the LB for my supervisor namespace, virtual servers in it, and server pool members in the virtual server.

I hope it was useful. Cheers! 

Tuesday, August 3, 2021

vSphere with Tanzu using NSX-T - Part10 - Upgrade K8s version of Tanzu Kubernetes cluster

In the previous posts we discussed the following:

Part1 - Prerequisites

Part2 - Configure NSX

Part3 - Edge Cluster

Part4 - Tier-0 Gateway and BGP peering

Part5 - Tier-1 Gateway and Segments

Part6 - Create tags, storage policy, and content library

Part7 - Enable workload management


In this article, I will explain how to upgrade the K8s version of a Tanzu Kubernetes cluster.

Verify the current K8s version of the Tanzu Kubernetes cluster.


Check available Tanzu Kubernetes versions.


Edit the cluster manifest file.


Here we are updating from 1.18.5 to 1.18.15.


Save the manifest file.


You can see the corresponding cluster starts updating.


The cluster will get updated to the newer version in a rolling fashion. The control plane node gets updated first, followed by the worker nodes one by one. A new node will be added to the cluster with new version, and an old node will be removed from the cluster.
 


Verify.

As you can see, the tkg-cluster-02 is upgraded from 1.18.5 to 1.18.15.


Hope it was useful. Cheers!

References

Monday, June 21, 2021

Validate your Kubernetes cluster using Sonobuoy

Sonobuoy is a diagnostic tool that helps to validate the state of a Kubernetes cluster by running a set of tests in an accessible and non-destructive manner. By default, Sonobuoy runs the Kubernetes conformance tests. The conformance testing ensures that a cluster is properly configured and that its behavior conforms to official Kubernetes specifications. It also helps ensure that a Kubernetes cluster meets the minimal set of features. They are a subset of end-to-end (e2e) tests that should pass on any Kubernetes cluster. 

A conformance-passing cluster provides the guarantee that your Kubernetes is properly configured as per best practices. There are around 275 tests that need to be passed for qualifying Kubernetes conformance.

Install Sonobuoy

wget https://github.com/vmware-tanzu/sonobuoy/releases/download/v0.51.0/sonobuoy_0.51.0_linux_amd64.tar.gz
tar -xvf sonobuoy_0.51.0_linux_amd64.tar.gz

Note: I am installing Sonobuoy on CentOS Linux release 7.9.2009 (Core).


Help
/root/sonobuoy --help

Run Sonobuoy
/root/sonobuoy run --wait

Note: e2e test takes around 60-90 minutes to complete.


Sonobuoy Objects
kubectl get all -n sonobuoy


kubectl get pods -n sonobuoy -o wide


Sonobuoy Status
/root/sonobuoy status
/root/sonobuoy status --json
/root/sonobuoy status --json | jq

Note: If you are getting this while using jq "bash: jq: command not found..." , follow this blog to install jq.


Inspect Logs
/root/sonobuoy logs

Sonobuoy Results
results=$(/root/sonobuoy retrieve)


/root/sonobuoy results $results
/root/sonobuoy results <tar ball file>



See passed/ failed tests
/root/sonobuoy results <tar ball file> --mode=detailed | jq 'select(.status=="passed")' /root/sonobuoy results <tar ball file> --mode=detailed | jq 'select(.status=="failed")'


List the conformance tests
/root/sonobuoy results <tar ball file> --mode=detailed| jq 'select(.name | contains("[Conformance]"))'

Cleanup
/root/sonobuoy delete --wait


References

https://github.com/vmware-tanzu/sonobuoy
https://sonobuoy.io/docs/v0.51.0/


Sunday, May 30, 2021

vSphere with Tanzu using NSX-T - Part8 - Create namespace and deploy Tanzu Kubernetes Cluster

In the previous posts we discussed the following:

vSphere with Tanzu using NSX-T - Part1 - Prerequisites

vSphere with Tanzu using NSX-T - Part2 - Configure NSX

vSphere with Tanzu using NSX-T - Part3 - Edge Cluster

vSphere with Tanzu using NSX-T - Part4 - Tier-0 Gateway and BGP peering

vSphere with Tanzu using NSX-T - Part5 - Tier-1 Gateway and Segments

vSphere with Tanzu using NSX-T - Part6 - Create tags, storage policy, and content library

vSphere with Tanzu using NSX-T - Part7 - Enable workload management


Now that we have enabled workload management, the next step is to create namespaces on the supervisor cluster, set resource quotas as per requirements, and then the vSphere administrator can provide access to developers to these namespaces, and they can either deploy Tanzu Kubernetes clusters or VMs or vSphere pods. 

  • Create namespace.

  • Select the cluster and provide a name for the namespace.

  • Now the namespace is created successfully. Before handing over this namespace to the developer, you can set permissions, assign storage policies, and set resource limits.

Let's have a look at the NSX-T components that are instantiated when we created a new namespace.
  • A new segment is now created for the newly created namespace. This segment is connected to the T1 Gateway of the supervisor cluster.

  • A SNAT rule is also now in place on the supervisor cluster T1 Gateway. This helps the Kubernetes objects residing in the namespace to reach the external network/ internet. It uses the egress range 192.168.72.0/24 that we provided during the workload management configuration for address translation.

We can now assign a storage policy to this newly created namespace.

  • Click on Add Storage and select the storage policy. In my case, I am using Tanzu Storage Policy which uses a vsanDatastore.

Let's apply some capacity and usage limits for this namespace. Click edit limits and provide the values.


Let's set user permissions to this newly created namespace. Click add permissions.


Now we are ready to hand over this new namespace to the dev user (John).


Under the first tile, you can see copy link, you can provide this link to the dev user. And he can open it in a web browser to access the CLI tools to connect to the newly created namespace.


Download and install the CLI tools. In my case, CLI tools are installed on a CentOS 7.x VM. You can also see the user John has connected to the newly created namespace using the CLI.


The user can now verify the resource limits of the namespace using kubectl.


You can see the following limits:
  • cpu-limit: 21.818
  • memory-limit: 131072Mi
  • storage: 500Gi
Storage is limited at 500 GB and memory at 128 GB which is very straightforward. We (vSphere admin) had set the CPU limits to 48 GHz. And here what you see is cpu-limit of this namespace is limited to 21.818 CPU cores. Just to give some more background on this calculation, the ESXi host that I am using for this study has 20 physical cores, and the total CPU capacity of a host is 44 GHz. I have 4 such ESXi hosts in the cluster. Now, the computing power of one physical core is (44/ 20) = 2.2 GHz. So, in order to limit the CPU to 48 GHz, the number of cpu core should be limited to (48/ 2.2) = 21.818.  

Apply the following cluster definition yaml file to create a Tanzu Kubernetes cluster under the ns-01-dev-john namespace.

apiVersion: run.tanzu.vmware.com/v1alpha1
kind: TanzuKubernetesCluster
metadata:
 name: tkg-cluster-01
 namespace: ns-01-dev-john
spec:
 topology:
   controlPlane:
     count: 3
     class: guaranteed-medium
     storageClass: tanzu-storage-policy
   workers:
     count: 3
     class: guaranteed-xlarge
     storageClass: tanzu-storage-policy
 distribution:
   version: v1.18.15
 settings:
  network:
   services:
    cidrBlocks: ["198.32.1.0/12"]
   pods:
    cidrBlocks: ["192.1.1.0/16"]
   cni:
    name: calico
  storage:
   defaultClass: tanzu-storage-policy


Login to the Tanzu Kubernetes cluster directly using CLI and verify.


You can see corresponding VMs in the Center UI.


Now, let's have a look at the NSX-T side.
  • A Tier-1 Gateway is now available with a segment linked to it.


  • You can see a server load balancer with one virtual server that provides access to KubeAPI (6443) of the Tanzu Kubernetes cluster that we just deployed.


  • You can also find a SNAT rule. This helps the Tanzu Kubernetes cluster objects to reach the external network/ internet. It uses the egress range 192.168.72.0/24 that we provided during the workload management configuration for address translation.

Note: This architecture is explained on the basis of vSphere 7 U1. In the newer versions there are changes. With vSphere 7 U1c the architecture changed from a per-TKG cluster Tier 1 Gateway model to a per-Supervisor namespace Tier 1 Gateway model. For more details, feel free to refer the blog series published by Harikrishnan T @hari5611.

In the next part we will discuss monitoring aspects of vSphere with Tanzu environment and Tanzu Kubernetes clusters. I hope this was useful. Cheers!