vineethac.blogspot.com: NSX-T

Showing posts with label NSX-T. Show all posts

Saturday, May 21, 2022

vSphere with Tanzu using NSX-T - Part15 - Working with etcd on TKC with one control plane

In this article, we will see how to work with etcd database of a Tanzu Kubernetes Cluster (TKC) with one control plane node and perform some basic operations. Following is a TKC with one control plane node and three worker nodes:

Get K8s cluster nodes

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE     VERSION
gc-control-plane-6g9gk              Ready    control-plane,master   3d7h    v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-n5qp8   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-wds2m   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-z2wvt   Ready    <none>                 7d19h   v1.21.6+vmware.1

Get the etcd pod and describe it

❯ gcc kg pod -A | grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          3d7h
❯
❯ gcc kd pod etcd-gc-control-plane-6g9gk -n kube-system
Name:                 etcd-gc-control-plane-6g9gk
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gc-control-plane-6g9gk/100.68.36.38
Start Time:           Tue, 19 Jul 2022 11:14:25 +0530
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
                      kubernetes.io/config.hash: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.mirror: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.seen: 2022-07-19T05:44:19.416549595Z
                      kubernetes.io/config.source: file
                      kubernetes.io/psp: vmware-system-privileged
Status:               Running
IP:                   100.68.36.38
IPs:
  IP:           100.68.36.38
Controlled By:  Node/gc-control-plane-6g9gk
Containers:
  etcd:
    Container ID:  containerd://253c7b25bd60ea78dfccad52d03534785f0d7b7a1fa7105dbd55d7727f8785c3
    Image:         localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    Image ID:      sha256:78661ebbe1adaee60336a0f8ff031c4537ff309ef51feab6e840e7dbb3cbf47d
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://100.68.36.38:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://100.68.36.38:2380
      --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
      --initial-cluster-state=existing
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://100.68.36.38:2380
      --name=gc-control-plane-6g9gk
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Tue, 19 Jul 2022 11:14:27 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     100Mi
    Liveness:     http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:  <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:            <none>

Exec into the etcd pod and run etcdctl commands

You can use etcdctl and you need to provide cacert, cert, and key details. All these info you will get while describing the etcd pod.

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
❯
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint health --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --write-out json"
[{"endpoint":"127.0.0.1:2379","health":true,"took":"9.689387ms"}]
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key -w json"
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":4073335150581888229,"member_id":14250600431682432472,"revision":2153804,"raft_term":11},"version":"3.4.13","dbSize":24719360,"leader":14250600431682432472,"raftIndex":2429139,"raftTerm":11,"raftAppliedIndex":2429139,"dbSizeInUse":2678784}}]

Snapshot etcd

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot save snapshotdb-$(date +%d-%m-%y)"
{"level":"info","ts":1658651541.2698474,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb-24-07-22.part"}
{"level":"info","ts":"2022-07-24T08:32:21.277Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1658651541.2771788,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-07-24T08:32:21.594Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1658651541.621639,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"25 MB","took":0.344746859}
{"level":"info","ts":1658651541.621852,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb-24-07-22"}
Snapshot saved at snapshotdb-24-07-22
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh
# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  snapshotdb-24-07-22  srv  sys  tmp  usr	var
#
# exit
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot status snapshotdb-24-07-22 -w table"
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+
❯

Copy snapshot file from etcd pod to local machine

Note: Even though there was an error while copying the snapshot file from the pod to local machine, you can see the file was successfully copied and I also verified the snapshot file status using etcdctl. Every field (hash, total_keys, etc.) matches with that of the source file.

❯ gcc kubectl cp kube-system/etcd-gc-control-plane-6g9gk:/snapshotdb-24-07-22 snapshotdb-24-07-22
tar: Removing leading `/' from member names
error: unexpected EOF
❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+       
❯

Restore etcd

We can restore the etcd snapshot using etcdctl from the TKC control plane node. Inorder to connect to the control plane VM, we need to create a jumpbox pod under the corresponding supervisor namespace.

So, first copy the snapshot file from local machine to jumpbox pod.

❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ kubectl cp snapshotdb-24-07-22 vineetha-test04-deploy/jumpbox01:/
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- sh
sh-4.4# su
root [ / ]# ls
bin   dev  home  lib64  mnt   root  sbin                 srv  tmp  var
boot  etc  lib   media  proc  run   snapshotdb-24-07-22  sys  usr
root [ / ]#

Copy the snapshot file from jumpbox pod to control plane node.

❯ gcc kg po -A -o wide| grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          166m   100.68.36.38   gc-control-plane-6g9gk              <none>           <none>
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- scp /snapshotdb-24-07-22 vmware-system-user@100.68.36.38:/tmp
snapshotdb-24-07-22                           100%   20MB 126.1MB/s   00:00
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- /usr/bin/ssh vmware-system-user@100.68.36.38
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Sun Jul 24 13:02:39 2022 from 100.68.35.210
 13:14:29 up 5 days,  7:38,  0 users,  load average: 0.98, 0.53, 0.31

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ sudo su
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# cd /tmp/

Install etcd on the control plane node, so that we get to access etcdctl utility.

root [ /tmp ]# tdnf install etcd
root [ /tmp ]# ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
root [ /tmp ]#
root [ /tmp ]# ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
Deprecated: Use `etcdutl snapshot status` instead.

b0910e83, 2434362, 1580, 25 MB
root [ /tmp ]# hostname
gc-control-plane-6g9gk
root [ /tmp ]# ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot restore /tmp/snapshotdb-24-07-22 --data-dir=/var/lib/etcd-backup --skip-hash-check=true
Deprecated: Use `etcdutl snapshot restore` instead.

2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:251	restoring snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/snapshot/v3_snapshot.go:257\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/command/snapshot_command.go:128\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/main.go:59\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
2022-07-24T13:20:44Z	info	membership/store.go:141	Trimming membership information from the backend...
2022-07-24T13:20:44Z	info	membership/cluster.go:421	added member	{"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]}
2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:272	restored snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap"}
root [ /var/lib ]#

We have restored the database snapshot to a new location: --data-dir=/var/lib/etcd-backup. So we need to modify the etcd-data hostpath to path: /var/lib/etcd-backup in the etcd static pod manifest file (etcd.yaml). Copy the contents of etcd.yaml file.

root [ /var/lib ]# cd /etc/kubernetes/manifests/
root [ /etc/kubernetes/manifests ]# ls
etcd.yaml            kube-controller-manager.yaml  registry.yaml
kube-apiserver.yaml  kube-scheduler.yaml
root [ /etc/kubernetes/manifests ]# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://100.68.36.38:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://100.68.36.38:2380
    - --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://100.68.36.38:2380
    - --name=gc-control-plane-6g9gk
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

I was having difficulties in the terminal to modify it. So I copied the contents of etcd.yaml file locally, modified the path, removed the existing etcd.yaml file, created new etcd.yaml file, and pasted the modifed content in it.

root [ /etc/kubernetes/manifests ]# rm etcd.yaml
root [ /etc/kubernetes/manifests ]# vi etcd.yaml

<paste the above etcd.yaml file contents, with modified etcd-data hostPath, last part of the yaml will look like below:

 volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd-backup
      type: DirectoryOrCreate
    name: etcd-data
status: {}

>

Once the etcd.yaml is saved, after few seconds you can see that etcd pod will be running.

root [ /etc/kubernetes/manifests ]# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
92dad43a85ebc       78661ebbe1ada       1 second ago        Running             etcd                           0                   8258804cf17bb
5c704092eb4bb       25605c4ab20fe       10 seconds ago      Running             csi-resizer                    10                  1fa9feb732df5
495b5ff250cb4       05cfd9e3c3f22       10 seconds ago      Running             csi-provisioner                22                  1fa9feb732df5
c0da43af7f1d0       a145efcc3afb4       11 seconds ago      Running             vsphere-syncer                 10                  1fa9feb732df5
4e6d67dc16f4a       5cb2119a4d797       11 seconds ago      Running             kube-controller-manager        11                  3d1c266aa5e24
8710a7b6a8563       fa70d7ee973ad       11 seconds ago      Running             guest-cluster-cloud-provider   35                  3f3d5eb0929e5
4e0c1e2e72682       f18cde23836f5       11 seconds ago      Running             csi-attacher                   10                  1fa9feb732df5
bf6771ca4dc4d       a609b91a17410       11 seconds ago      Running             kube-scheduler                 11                  f50eada3f8127
05fa3f2f587e8       fa70d7ee973ad       3 hours ago         Exited              guest-cluster-cloud-provider   34                  3f3d5eb0929e5
888ba7ce34d92       3f6d2884f8105       3 hours ago         Running             kube-apiserver                 28                  c368bd9937f8b
6579ffffa5f53       382a8821c56e0       3 hours ago         Running             metrics-server                 21                  622c835648008
bdd0c760d0bc7       382a8821c56e0       3 hours ago         Exited              metrics-server                 20                  622c835648008
6063485d73e38       78661ebbe1ada       3 hours ago         Exited              etcd                           0                   71e6ae04726ad
259595a330d26       3f6d2884f8105       3 hours ago         Exited              kube-apiserver                 27                  c368bd9937f8b
5034f4ea18f1f       25605c4ab20fe       4 hours ago         Exited              csi-resizer                    9                   1fa9feb732df5
21de3c4850dc3       05cfd9e3c3f22       4 hours ago         Exited              csi-provisioner                21                  1fa9feb732df5
d623946b1d270       a145efcc3afb4       4 hours ago         Exited              vsphere-syncer                 9                   1fa9feb732df5
5f50b9e93d287       a609b91a17410       4 hours ago         Exited              kube-scheduler                 10                  f50eada3f8127
ec4e066f54fd6       5cb2119a4d797       4 hours ago         Exited              kube-controller-manager        10                  3d1c266aa5e24
f05fee251a700       f18cde23836f5       4 hours ago         Exited              csi-attacher                   9                   1fa9feb732df5
d3577bf8477d0       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   1fa9feb732df5
d30ba30b8c203       4251b7012fd43       5 days ago          Running             vsphere-csi-controller         0                   1fa9feb732df5
1816ed6aada5f       02abc4bd595a0       5 days ago          Running             guest-cluster-auth-service     0                   bff70bcd389be
3e014a6745f5a       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   66e5ca3abfe5b
ba82f29cf8939       4251b7012fd43       5 days ago          Running             vsphere-csi-node               0                   66e5ca3abfe5b
9c29960956718       f3fe18dd8cea2       5 days ago          Running             node-driver-registrar          0                   66e5ca3abfe5b
92aa3a904d72e       0515f8357a522       5 days ago          Running             antrea-agent                   3                   4ae63a9f8e4cb
9e0f32fa88663       0515f8357a522       5 days ago          Exited              antrea-agent                   2                   4ae63a9f8e4cb
ac5573a4d93f8       0515f8357a522       5 days ago          Running             antrea-ovs                     0                   4ae63a9f8e4cb
e14fea2c37c21       0515f8357a522       5 days ago          Exited              install-cni                    0                   4ae63a9f8e4cb
166627360a434       7fde82047d4f6       5 days ago          Running             docker-registry                0                   c84c550a9ab90
ecfbdcd23858d       f31127f4a3471       5 days ago          Running             kube-proxy                     0                   07f5b9be02414
root [ /etc/kubernetes/manifests ]# exit
exit
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ exit
logout

Verify

In my case I had a namespace vineethac-testing with two nginx pods running under it while the snapshot was taken. After the snapshot was taken, I deleted the two nginx pods and the namespace vineethac-testing. After restoring the etcd snapshot, I can see that the namespace vineethac-testing is active with two nginx pods under it.

❯ gcc kg ns
NAME                           STATUS   AGE
default                        Active   9d
kube-node-lease                Active   9d
kube-public                    Active   9d
kube-system                    Active   9d
vineethac-testing              Active   4h28m
vmware-system-auth             Active   9d
vmware-system-cloud-provider   Active   9d
vmware-system-csi              Active   9d
❯
❯ gcc kg pods -n vineethac-testing
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          4h25m
nginx2   1/1     Running   0          4h23m

Hope it was useful. Cheers!

Note: I've tested this in a lab. This may not be the best practice procedure and may slightly vary in a real world environment.

Sunday, January 30, 2022

vSphere with Tanzu using NSX-T - Part14 - Testing TKC storage using kubestr

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster
Part11 - Troubleshooting TKC
Part12 - Deploy application on TKC and access it
Part13 - Export WCP admin kubeconfig

This article is about using kubestr to test storage options of Tanzu Kubernetes Cluster (TKC). Following are the steps to install kubestr on MAC:

wget https://github.com/kastenhq/kubestr/releases/download/v0.4.31/kubestr_0.4.31_MacOS_amd64.tar.gz
tar -xvf kubestr_0.4.31_MacOS_amd64.tar.gz
chmod +x kubestr
mv kubestr /usr/local/bin

Now, lets do kubestr help.

% kubestr help
kubestr is a tool that will scan your k8s cluster
       and validate that the storage systems in place as well as run
       performance tests.

Usage:
kubestr [flags]
kubestr [command]

Available Commands:
browse      Browse the contents of a CSI PVC via file browser
csicheck    Runs the CSI snapshot restore check
fio         Runs an fio test
help        Help about any command

Flags:
-h, --help             help for kubestr
-e, --outfile string   The file where test results will be written
-o, --output string    Options(json)

Use "kubestr [command] --help" for more information about a command.

I am going to use the following TKC for testing.

% KUBECONFIG=gc.kubeconfig kubectl get nodes
NAME                               STATUS   ROLES                  AGE    VERSION
gc-control-plane-pwngg             Ready    control-plane,master   103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 103d   v1.20.9+vmware.1

Let's run kubestr against the cluster now.

% KUBECONFIG=gc.kubeconfig kubestr

**************************************
_ ___   _ ___ ___ ___ _____ ___
| |/ / | | | _ ) __/ __|_   _| _ \
| ' <| |_| | _ \ _|\__ \ | | |   /
|_|\_\\___/|___/___|___/ |_| |_|_\

Explore your Kubernetes storage options
**************************************
Kubernetes Version Check:
Valid kubernetes version (v1.20.9+vmware.1) - OK

RBAC Check:
Kubernetes RBAC is enabled - OK

Aggregated Layer Check:
The Kubernetes Aggregated Layer is enabled - OK

W0130 14:17:16.937556   87541 warnings.go:70] storage.k8s.io/v1beta1 CSIDriver is deprecated in v1.19+, unavailable in v1.22+; use storage.k8s.io/v1 CSIDriver
Available Storage Provisioners:

csi.vsphere.xxxx.com:
    Can't find the CSI snapshot group api version.
    This is a CSI driver!
    (The following info may not be up to date. Please check with the provider for more information.)
    Provider:            vSphere
    Website:             https://github.com/kubernetes-sigs/vsphere-csi-driver
    Description:         A Container Storage Interface (CSI) Driver for VMware vSphere
    Additional Features: Raw Block,<br/><br/>Expansion (Block Volume),<br/><br/>Topology Aware (Block Volume)

    Storage Classes:
      * sc2-01-vc16c01-wcp-mgmt

    To perform a FIO test, run-
      ./kubestr fio -s <storage class>

You can run storage tests using kubestr and it uses FIO for generating IOs. For example this is how you can run a basic storage test.

% KUBECONFIG=gc.kubeconfig kubestr fio -s sc2-01-vc16c01-wcp-mgmt -z 10G
PVC created kubestr-fio-pvc-zvdhr
Pod created kubestr-fio-pod-kdbs5
Running FIO test (default-fio) on StorageClass (sc2-01-vc16c01-wcp-mgmt) with a PVC of Size (10G)
Elapsed time- 29.290421119s
FIO test results:

FIO version - fio-3.20
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=3987.150391 BW(KiB/s)=15965
iops: min=3680 max=4274 avg=3992.034424
bw(KiB/s): min=14720 max=17096 avg=15968.827148

JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=3562.628906 BW(KiB/s)=14267
iops: min=3237 max=3750 avg=3565.896484
bw(KiB/s): min=12950 max=15000 avg=14264.862305

JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=2988.549316 BW(KiB/s)=383071
iops: min=2756 max=3252 avg=2992.344727
bw(KiB/s): min=352830 max=416256 avg=383056.187500

JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=2754.796143 BW(KiB/s)=353151
iops: min=2480 max=2992 avg=2759.586182
bw(KiB/s): min=317440 max=382976 avg=353242.781250

Disk stats (read/write):
sdd: ios=117160/105647 merge=0/1210 ticks=2100090/2039676 in_queue=4139076, util=99.608589%
- OK

As you can see, a PVC of 10G, a FIO pod will be created, and this will be used for the FIO test. Once the test is complete, the PVC and FIO pod will be deleted automatically.

I hope it was useful. Cheers!

Saturday, December 18, 2021

vSphere with Tanzu using NSX-T - Part13 - Export WCP admin kubeconfig

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster
Part11 - Troubleshooting TKC
Part12 - Deploy application on TKC and access it

This article shows the steps to export WCP admin kubeconfig file from the supervisor control plane VM. This is the admin kubeconfig file that can be used to manage the Supervisor/ WCP K8s cluster.

Step1: SSH as root to the vCenter server.

Step2: Run the script /usr/lib/vmware-wcp/decryptK8Pwd.py and make a note of the IP and PWD.

Step3: SSH as root to the IP that you noted down from previous step, and then provide the password that you got from step2.

Step4: You can now copy the admin kubeconfig file from /etc/kubernetes/admin.conf file to your local machine. Make sure to modify the field server: https://127.0.0.1:6443 in your local admin.conf file to the IP that you got from step2 (server: https://IP_from_step2:6443).

Note: If you are managing multiple WCP clusters, you can merge all the kubeconfig files. Refer this blog by Jacob Tomlinson for more details.

Hope it was useful. Cheers!

Friday, November 5, 2021

vSphere with Tanzu using NSX-T - Part12 - Deploy application on TKC and access it

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster
Part11 - Troubleshooting TKC

This article walks you though the steps to deploy an application on Tanzu Kubernetes Cluster (TKC) and how to access it. I will try to explain how this all works under the hood.

Here I have a TKC cluster as shown below:

% KUBECONFIG=gc.kubeconfig kg nodes
NAME                               STATUS   ROLES                  AGE   VERSION
gc-control-plane-pwngg             Ready    control-plane,master   49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 49d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 49d   v1.20.9+vmware.1

% KUBECONFIG=gc.kubeconfig kg nodes -o wide
NAME                               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION       CONTAINER-RUNTIME
gc-control-plane-pwngg             Ready    control-plane,master   49d   v1.20.9+vmware.1   172.29.21.194   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.195   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.196   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 49d   v1.20.9+vmware.1   172.29.21.197   <none>        VMware Photon OS/Linux   4.19.191-4.ph3-esx   containerd://1.4.6

01 Create a namespace

% KUBECONFIG=gc.kubeconfig k create ns webserver
namespace/webserver created

% KUBECONFIG=gc.kubeconfig kg ns
NAME                           STATUS   AGE
default                        Active   48d
kube-node-lease                Active   48d
kube-public                    Active   48d
kube-system                    Active   48d
vmware-system-auth             Active   48d
vmware-system-cloud-provider   Active   48d
vmware-system-csi              Active   48d
webserver                      Active   10s

02 Deploy nginx application

Following is the nginx-deployment.yaml spec to deploy nginx application:

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
    matchLabels:
      run: my-nginx
replicas: 2
template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80

You can apply the yaml file as below:

% KUBECONFIG=gc.kubeconfig k apply -f nginx-deployment.yaml -n webserver
deployment.apps/my-nginx created

% KUBECONFIG=gc.kubeconfig kg deploy -n webserver
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
my-nginx   0/2     0            0           3m3s

% KUBECONFIG=gc.kubeconfig kg events -n webserver
LAST SEEN   TYPE      REASON              OBJECT                           MESSAGE
26s         Warning   FailedCreate        replicaset/my-nginx-74d7c6cb98   Error creating: pods "my-nginx-74d7c6cb98-" is forbidden: PodSecurityPolicy: unable to admit pod: []
3m10s       Normal    ScalingReplicaSet   deployment/my-nginx              Scaled up replica set my-nginx-74d7c6cb98 to 2

You can see that the pods failed to get created due to PodSecurityPolicy. Following is the psp.yaml spec to create ClusterRole and ClusterRoleBinding.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: psp:privileged
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- vmware-system-privileged
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: all:psp:privileged
roleRef:
kind: ClusterRole
name: psp:privileged
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
name: system:serviceaccounts
apiGroup: rbac.authorization.k8s.io

Apply the yaml file as shown below:

% KUBECONFIG=gc.kubeconfig k apply -f psp.yaml
clusterrole.rbac.authorization.k8s.io/psp:privileged created
clusterrolebinding.rbac.authorization.k8s.io/all:psp:privileged created

Now, in few minutes you can see the deployment will get successful and two nginx pods will get deployed in the webserver namespace.

% KUBECONFIG=gc.kubeconfig kg deploy -n webserver
NAME READY UP-TO-DATE AVAILABLE AGE
my-nginx 2/2 2 2 80m

% KUBECONFIG=gc.kubeconfig kg pods -n webserver -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP                NODE                               NOMINATED NODE   READINESS GATES
my-nginx-74d7c6cb98-lzghr   1/1     Running   0          67m   192.168.213.132   gc-workers-wrknn-f675446b6-rsf6n   <none>           <none>
my-nginx-74d7c6cb98-s59dt   1/1     Running   0          67m   192.168.67.196    gc-workers-wrknn-f675446b6-f6zqs   <none>           <none>

03 Access the application

You can access the application in many ways depending on the usecase.

---Port-forward---

% KUBECONFIG=gc.kubeconfig kubectl port-forward deployment/my-nginx -n webserver 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080

The deployment is port-forwarded now. If you open another terminal and do curl localhost:8080, you can see the nginx webpage.

% curl localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

You can also open a web browser with http://localhost:8080/ and you will get the same nginx webpage. Well port-forwarding is fine in a local dev test scenario, but you might not want to do it in a production setup. You will need to create a service that connects the application and to access it.

Services

There are 3 types of services in Kubernetes.

NodePort: Similar to port forwarding where a port on the worker node will be forwarded to the target port of the pod where the application is running.
ClusterIP: This is useful if you want to access the application from within the cluster.
LoadBalancer: This is used to provide access to external users. In my case, NSX-T will be providing this access.

---Service NodePort---

Following is the yaml spec file for service of type nodeport:

% cat nginx-service-np.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
    run: my-nginx
spec:
type: NodePort
ports:
- targetPort: 80
    port: 80
    protocol: TCP
selector:
    run: my-nginx

Apply the above yaml file.

% KUBECONFIG=gc.kubeconfig k apply -f nginx-service-np.yaml -n webserver
service/my-nginx created

% KUBECONFIG=gc.kubeconfig kg svc -n webserver
NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   NodePort   10.111.182.155   <none>        80:30741/TCP   4s

% KUBECONFIG=gc.kubeconfig kg ep -n webserver
NAME       ENDPOINTS                              AGE
my-nginx   192.168.213.132:80,192.168.67.196:80   32m

As you can see, a service (my-nginx) of type NodePort is created. And, now the application should be accessible on port 30741 of any worker node. To verify it, first we need connectivity to the worker node IP. For connecting to worker nodes, we need to have a jumpbox pod deployed on the supervisor namespace. Once we have a jumpbox pod deployed on the sv namespace, we can ssh to TKC nodes from the jumpbox pod. You can follow my previous post to see how to create a jumpbox pod. Here is the link to VMware documentation for how to SSH to TKC nodes.

% KUBECONFIG=sv.kubeconfig k exec -it jumpbox -- sh
sh-4.4#
sh-4.4# curl 172.29.21.197:30741
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
sh-4.4#

---Service ClusterIP---

Service of type ClusterIP will be accessible within the TKC. So, I will need to deploy a jumpbox pod/ test pod within the TKC and connect from there. First let me edit the svc my-nginx from NodePort to type ClusterIP.

% KUBECONFIG=gc.kubeconfig k edit svc my-nginx -n webserver
service/my-nginx edited

% KUBECONFIG=gc.kubeconfig kg svc -n webserver
NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
my-nginx   ClusterIP   10.111.182.155   <none>        80/TCP    39m

I have already deploy a pod inside the TKC. As you can see, dnsutils is the pod that is deployed in the default namespace. We will connect to this pod and from there we can curl to the Cluster-IP of my-nginx service.

% KUBECONFIG=gc.kubeconfig kg pods
NAME       READY   STATUS    RESTARTS   AGE
dnsutils   1/1     Running   1          105m

% KUBECONFIG=gc.kubeconfig k exec -it dnsutils -- sh
#
# curl 10.111.182.155:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
#

Note: This service of type ClusterIP can be accessed only within the TKC, and not externally!

---Service LoadBalancer---

This is the way to expose your service to external users. In this case NSX-T will provide the external IP which will then internally forwarded to nginx pods through the my-nginx service.

I have edited the service my-nginx from type ClusterIP to LoadBalancer.

% KUBECONFIG=gc.kubeconfig k edit svc my-nginx -n webserver
service/my-nginx edited

% KUBECONFIG=gc.kubeconfig kg svc -n webserver
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   LoadBalancer   10.111.182.155   <pending>     80:32398/TCP   56m

% KUBECONFIG=gc.kubeconfig kg svc -n webserver
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx LoadBalancer 10.111.182.155 10.186.148.170 80:32398/TCP 56m

You can see that now the service has got an external ip. And, the end points of the service are as shown below, which is basically the nginx pod IPs.

% KUBECONFIG=gc.kubeconfig kg ep -n webserver
NAME ENDPOINTS AGE
my-nginx 192.168.213.132:80,192.168.67.196:80 58m

% curl 10.186.148.170
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

I could also use the external IP 10.186.148.170 in a web browser to access the nginx webpage.

Now lets have a look at what is in the supervisor namespace. This TKC is created under a supervisor namespace "vineetha-test04-deploy".

% kubectl get svc -n vineetha-test04-deploy
NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE
gc-ba320a1e3e04259514411   LoadBalancer   172.28.5.217    10.186.148.170   80:31143/TCP     40h
gc-control-plane-service   LoadBalancer   172.28.9.37     10.186.149.120   6443:31639/TCP   51d

% kubectl get ep -n vineetha-test04-deploy
NAME                       ENDPOINTS                                                     AGE
gc-ba320a1e3e04259514411   172.29.21.195:32398,172.29.21.196:32398,172.29.21.197:32398   40h
gc-control-plane-service   172.29.21.194:6443                                            51d

So what you are seeing is, for a service of type loadbalancer created inside the TKC, a service of type loadbalancer (gc-ba320a1e3e04259514411) will be automatically created under the supervisor namespace, and the its endpoints are the IP address of TKC worker nodes.

On the NSX-T side you can see the LB for my supervisor namespace, virtual servers in it, and server pool members in the virtual server.

I hope it was useful. Cheers!

Saturday, September 25, 2021

vSphere with Tanzu using NSX-T - Part11 - Troubleshooting Tanzu Kubernetes Clusters

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster

In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.

List all TKCs that are stuck at creating/ updating:
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating

On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.

alias kgtkc='kubectl get tkc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas --sort-by="metadata.creationTimestamp"'

You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:

% kgtkc | grep updating
c1nsxtest1-sla gc updating 2021-01-21T08:23:37Z v1.19.7+vmware.1-tkg.2.f52f85a 3 3
w2cei-sep20 gc updating 2021-09-16T17:48:07Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4

For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:

See events in a namespace:
kubectl get events -n <namespace>

See all events:
kubectl get events -A

Watch events in a namespace:
kubectl get events -n <namespace> -w

List the Cluster API resources supporting the clusters in the current namespace:
kubectl get cluster-api -n <namespace>

Describe TKC:
kubectl describe tkc <tkc_name> -n <namespace>

List TKC virtual machines in a namespace:
kubectl get vm -n <namespace>

List TKC virtual machines in a namespace with its IP:
kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'

List all nodes of a cluster:
kubectl get nodes -o wide

List all pods that are not running:
kubectl get pods -A | grep -vi running

List health status of different cluster components:
kubectl get --raw '/healthz?verbose'

% kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

List all CRDs installed in your cluster and their API versions:
kubectl api-resources -o wide --sort-by="name"

List available Tanzu Kubernetes releases:
kubectl get tanzukubernetesreleases

List available virtual machine images:
kubectl get virtualmachineimages

List terminating namespaces:
kubectl get ns --field-selector status.phase=Terminating

You can ssh to the Tanzu Kubernetes cluster nodes as the system user following this:
https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html

Here is an example where I have a TKC under namespace: vineetha-test05-deploy

% kubectl get tkc -n vineetha-test05-deploy
NAME   CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
gc     1               3        v1.20.9---vmware.1-tkg.1.a4cee5b   4d5h   True    True             [1.21.2+vmware.1-tkg.1.ee25d55]

% kubectl get vm -n vineetha-test05-deploy -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
[
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-control-plane-ttkmt",
    "internalIP": "172.29.4.194"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-d286z",
    "internalIP": "172.29.4.195"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-hwr8b",
    "internalIP": "172.29.4.197"
},
{
    "namespace": "vineetha-test05-deploy",
    "name": "gc-workers-7fcql-6f984fdd59-r99x7",
    "internalIP": "172.29.4.196"
}
]

Given below is the yaml file that deploys a pod named jumpbox under the supervisor namespace vineetha-test05-deploy, and from there you can ssh to the TKC nodes.

apiVersion: v1
kind: Pod
metadata:
name: jumpbox
namespace: vineetha-test05-deploy           #REPLACE
spec:
containers:
- image: "photon:3.0"
    name: jumpbox
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
    volumeMounts:
      - mountPath: "/root/ssh"
        name: ssh-key
        readOnly: true
    resources:
      requests:
        memory: 2Gi
volumes:
    - name: ssh-key
      secret:
        secretName: gc-ssh     #REPLACE

Once you apply the above yaml, you can see the jumpbox pod.

% kubectl get pod -n vineetha-test05-deploy
NAME      READY   STATUS    RESTARTS   AGE
jumpbox   1/1     Running   0          22m

Now, you can connect to the TKC node with its internal IP.

% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34
16:50:34 up 4 days, 5:49, 0 users, load average: 2.14, 0.97, 0.65

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname
gc-control-plane-ttkmt

You can check the status of control plane pods using crictl ps.

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
bde228417c55a       9000c334d9197       4 days ago          Running             guest-cluster-auth-service     0                   d7abf3db8670d
bc4b8c1bf0e33       a294c1cf07bd6       4 days ago          Running             metrics-server                 0                   2665876cf939e
46a94dcf02f3e       92cb72974660c       4 days ago          Running             coredns                        0                   7497cdf3269ab
f7d32016d6fb7       f48f23686df21       4 days ago          Running             csi-resizer                    0                   b887d394d4f80
ef80f62f3ed65       2cba51b244f27       4 days ago          Running             csi-provisioner                0                   b887d394d4f80
64b570add2859       4d2e937854849       4 days ago          Running             liveness-probe                 0                   b887d394d4f80
c0c1db3aac161       d032188289eb5       4 days ago          Running             vsphere-syncer                 0                   b887d394d4f80
e4df023ada129       e75228f70c0d6       4 days ago          Running             vsphere-csi-controller         0                   b887d394d4f80
e79b3cfdb4143       8a857a48ee57f       4 days ago          Running             csi-attacher                   0                   b887d394d4f80
96e4af8792cd0       b8bffc9e5af52       4 days ago          Running             calico-kube-controllers        0                   b5e467a43b34a
23791d5648ebb       92cb72974660c       4 days ago          Running             coredns                        0                   9bde50bbfb914
0f47d11dc211b       ab1e2f4eb3589       4 days ago          Running             guest-cluster-cloud-provider   0                   fde68175c5d95
5ddfd46647e80       4d2e937854849       4 days ago          Running             liveness-probe                 0                   1a88f26173762
578ddeeef5bdd       e75228f70c0d6       4 days ago          Running             vsphere-csi-node               0                   1a88f26173762
3fcb8a287ea48       9a3d9174ac1e7       4 days ago          Running             node-driver-registrar          0                   1a88f26173762
91b490c14d085       dc02a60cdbe40       4 days ago          Running             calico-node                    0                   35cf458eb80f8
68dbbdb779484       f7ad2965f3ac0       4 days ago          Running             kube-proxy                     0                   79f129c96e6e1
ef423f4aeb128       75bfe47a404bb       4 days ago          Running             docker-registry                0                   752724fbbcd6a
26dd8e1f521f5       9358496e81774       4 days ago          Running             kube-apiserver                 0                   814e5d2be5eab
62745db4234e2       ab8fb8e444396       4 days ago          Running             kube-controller-manager        0                   94543f93f7563
f2fc30c2854bd       9aa6da547b7eb       4 days ago          Running             etcd                           0                   f0a756a4cdc09
b8038e9f90e15       212d4c357a28e       4 days ago          Running             kube-scheduler                 0                   533a44c70e86c

You can check the status of kubelet and containerd services:
sudo systemctl status kubelet.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status kubelet.service
WARNING: terminal is not fully functional
- (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:>
Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago
     Docs: http://kubernetes.io/docs/
Main PID: 2234 (kubelet)
    Tasks: 16 (limit: 4728)
   Memory: 88.6M
   CGroup: /system.slice/kubelet.service
           └─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>

Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785    >
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045    >

sudo systemctl status containerd.service

vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status containerd.service
WARNING: terminal is not fully functional
- (press RETURN)● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres>
   Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago
     Docs: https://containerd.io
Main PID: 1783 (containerd)
    Tasks: 386 (limit: 4728)
   Memory: 639.3M
   CGroup: /system.slice/containerd.service
           ├─ 1783 /usr/local/bin/containerd
           ├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
           ├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>

If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:

vmware-system-user@gc-control-plane-ttkmt [ /var/log ]$ ls
audit                  devicelist sa                  vmware-vgauthsvc.log.0
auth.log               journal     sgidlist            vmware-vmsvc-root.log
btmp                   kubernetes stigreport.log      vmware-vmtoolsd-root.log
cloud-init.log         lastlog     suidlist            wtmp
cloud-init-output.log pods        tallylog
containers             private     vmware-imc
cron                   rpmcheck    vmware-network.log

Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-1

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-2

https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-3

Hope it was useful. Cheers!

Pages

Saturday, May 21, 2022

Sunday, January 30, 2022

Saturday, December 18, 2021

Friday, November 5, 2021

01 Create a namespace

02 Deploy nginx application

03 Access the application

---Port-forward---

Services

---Service NodePort---

---Service ClusterIP---

---Service LoadBalancer---

Saturday, September 25, 2021