vineethac.blogspot.com

Sunday, July 17, 2022

vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues

In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.

Case 1:

TKC (guest K8s cluster) deployments failing as VMs were not getting deployed. You can see Failed to deploy OVF package error in the VC UI. This was due to error A general system error occurred: HTTP request error: cannot authenticate SSL certificate for host wp-content.vmware.com while syncing content library.

Following is a sample log for this issue from the vmop-controller-manger:

Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error

This can be resolved by just editing the content library and accepting new certificate thumbprint.

Case 2:

Missing TKRs. Even though CL is present in the VC and will have all required OVF Templates, on the supervisor cluster TKR resources will be missing/ not found.

❯ kubectl get tkr
No resources found

This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.

If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.

You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.

Case 3:

TKC deployments failing as VMs were not getting deployed. Sample vmop-controller-manger logs given below:

E0803 18:51:30.638787       1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638821       1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.638851       1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"

E0803 18:51:30.639301       1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"

This could be resolved by restarting the cm-inventory service on all nsx-t manager nodes. Following are the commands to restart cm-inventory service on NSX-T manager nodes:

get service cm-inventory  
restart service cm-inventory

Case 4:

Sometimes in the WCP K8s layer you will notice some stale contentsources object entries. Contentsources are the corresponding objects of content libraries in K8s layer. Due to some reasons/ requirements you might have created multiple content libraries, and you may have delete some of them at later point of time from the vCenter, but they may not be removed properly from the WCP K8s layer and thats how these stale contentsources objects are found. You can use PowerCLI to list the current content libraries present in the VC, compare it with the contentsources and remove the stale entries.

> Get-ContentLibrary | select Name,Id | fl

Name : wdc-01-vc18c01-wcp
Id   : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d

> kg contentsources
NAME                                   AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db   173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d   321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662   242d
75e0668c-0cdc-421e-965d-fd736187cc57   173d
818c8700-efa4-416b-b78f-5f22e9555952   173d
9abbd108-aeb3-4b50-b074-9e6c00473b02   173d
a6cd1685-49bf-455f-a316-65bcdefac7cf   173d
acff9a91-0966-4793-9c3a-eb5272b802bd   242d
fcc08a43-1555-4794-a1ae-551753af9c03   173d

In the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.

Hope it was useful. Cheers!

Saturday, June 25, 2022

GitOps using Argo CD - Part1

In this article we will see how to use Git and Argo CD to deploy/ manage applications on your Kubernetes cluster. Before that, what is GitOps? Simply put, Operations driven using Git and CD tools! Following are the major components of GitOps:

Infrastructure as code (IaC)
Merge/ pull requests as change agent
Continuous Delivery tool (Example: Argo CD)

Basically you keep all your application deployment manifests in a Git repository. And, if you like to make changes to your application, you create a merge/ pull request. Once it is approved and merged to the main branch a continuous delivery/ deployment tool like Argo CD will identify that and deploys the latest change to the target Kubernetes cluster. Here I am using a Tanzu Kubernetes Cluster with 1 control plane node and 3 worker nodes. I will be deploying Argo CD as well as my application on this K8s cluster.

❯ kubectl get node
NAME                                STATUS   ROLES                  AGE   VERSION
gc-control-plane-rhpmq              Ready    control-plane,master   34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-692n5   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-jfzrs   Ready    <none>                 34d   v1.21.6+vmware.1
gc-workers-kfx7q-589888f77b-xvjsh   Ready    <none>                 34d   v1.21.6+vmware.1

Lets create a namespace first.

❯ kubectl create namespace argocd
namespace/argocd created

❯ kubectl apply -f https://gist.githubusercontent.com/vineethac/dafa5b47afd674a1a9f7be2ce773a2bd/raw/4591e837098043b8095a5c48614e1c94b5ca2b44/tkg-psp.yml
clusterrole.rbac.authorization.k8s.io/psp:privileged created
clusterrolebinding.rbac.authorization.k8s.io/all:psp:privileged created

Following yaml file will install Argo CD:

❯ kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

❯ kubectl get all -n argocd
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/argocd-application-controller-0                     1/1     Running   0          20h
pod/argocd-applicationset-controller-5f7d8fffb7-82xgp   1/1     Running   0          20h
pod/argocd-dex-server-75f7cff9cd-7vc64                  1/1     Running   0          20h
pod/argocd-notifications-controller-69bf646f87-8bt5n    1/1     Running   0          20h
pod/argocd-redis-748569f956-bskfw                       1/1     Running   0          19h
pod/argocd-repo-server-8699756b5d-7qmx2                 1/1     Running   0          20h
pod/argocd-server-6dd9cd7964-gbfm4                      1/1     Running   0          20h

NAME                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-applicationset-controller          ClusterIP   10.106.103.231   <none>        7000/TCP,8080/TCP            20h
service/argocd-dex-server                         ClusterIP   10.103.79.207    <none>        5556/TCP,5557/TCP,5558/TCP   20h
service/argocd-metrics                            ClusterIP   10.105.254.212   <none>        8082/TCP                     20h
service/argocd-notifications-controller-metrics   ClusterIP   10.97.254.140    <none>        9001/TCP                     20h
service/argocd-redis                              ClusterIP   10.97.244.161    <none>        6379/TCP                     20h
service/argocd-repo-server                        ClusterIP   10.101.181.242   <none>        8081/TCP,8084/TCP            20h
service/argocd-server                             ClusterIP   10.105.76.149    <none>        80/TCP,443/TCP               20h
service/argocd-server-metrics                     ClusterIP   10.100.168.241   <none>        8083/TCP                     20h

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/argocd-applicationset-controller   1/1     1            1           20h
deployment.apps/argocd-dex-server                  1/1     1            1           20h
deployment.apps/argocd-notifications-controller    1/1     1            1           20h
deployment.apps/argocd-redis                       1/1     1            1           20h
deployment.apps/argocd-repo-server                 1/1     1            1           20h
deployment.apps/argocd-server                      1/1     1            1           20h

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/argocd-applicationset-controller-5f7d8fffb7   1         1         1       20h
replicaset.apps/argocd-dex-server-75f7cff9cd                  1         1         1       20h
replicaset.apps/argocd-notifications-controller-69bf646f87    1         1         1       20h
replicaset.apps/argocd-redis-748569f956                       1         1         1       19h
replicaset.apps/argocd-repo-server-8699756b5d                 1         1         1       20h
replicaset.apps/argocd-server-6dd9cd7964                      1         1         1       20h

NAME                                             READY   AGE
statefulset.apps/argocd-application-controller   1/1     20h

Lets port forward, so that you can access the Argo CD web UI.

❯ kubectl port-forward -n argocd service/argocd-server 8080:443
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Now you can access Argo CD in your web browser at https://localhost:8080/

username: admin
password: you can decode it from the following secret

❯ kubectl get secret argocd-initial-admin-secret -n argocd -o yaml
apiVersion: v1
data:
password: TTFSRXJJZ0NKbHc2Y2JINA==
kind: Secret
metadata:
creationTimestamp: "2022-06-22T13:00:14Z"
name: argocd-initial-admin-secret
namespace: argocd
resourceVersion: "39224285"
uid: d3b7c82e-0c92-4418-95eb-95a73fe674b6
type: Opaque

❯ echo TTFSRXJJZ0NKbHc2Y2JINA== | base64 --decode

Next step is to create a repository and add your application yaml files to it. You will also need to create a Argo CD application yaml file, push it to the repository, and then apply the same application yaml file to your cluster.

Following is a screenshot of my repo:

We have the application.yaml file and then inside the dev folder I have two yaml files.

Now, we also have the application yaml file. Make sure to paste your git repo url in the repoURL field.

Lets apply the application yaml manifest on to the Kubernetes cluster and that will connect Argo CD with your git repo.

Note: If you using a private repo, then you need to add your repository to Argo CD first and connect with respective credentials.

> kubectl apply -f /Users/vineetha/myrepo/argocd/application.yaml

Once the application yaml is applied, after few seconds you can see nginx pod and svc getting deployed automatically under myapp namespace.

❯ kubectl get pods,deployment,svc -n myapp
NAME                            READY   STATUS    RESTARTS   AGE
pod/my-nginx-74d7c6cb98-tzdfp   1/1     Running   0          2d

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-nginx   1/1     1            1           2d

NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/my-nginx   ClusterIP   10.109.67.35   <none>        80/TCP    2d

From now on if you like to make changes to your application, say, you want to scale the replicas from 1 to 3, you have to create a merge request or pull request to the repository with the respective change, and once its approved and merged to the main branch Argo CD will automatically detect it and pull it and apply it to your Kubernetes cluster.

Hope it was useful. Cheers!

Reference video by Nana https://twitter.com/Njuchi_

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces

Following code snipet uses Python client for the kubernetes API to get namespace details from a given context:

from kubernetes import client, config
import argparse


def load_kubeconfig(context_name):
    config.load_kube_config(context=f"{context_name}")
    v1 = client.CoreV1Api()
    return v1


def get_all_namespace(v1):
    print("Listing namespaces with their creation timestamp, and status:")
    ret = v1.list_namespace()
    for i in ret.items:
        print(i.metadata.name, i.metadata.creation_timestamp, i.status.phase)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    args = parser.parse_args()

    context = args.context
    v1 = load_kubeconfig(context)
    get_all_namespace(v1)


if __name__ == "__main__":
    main()

Saturday, May 21, 2022

vSphere with Tanzu using NSX-T - Part15 - Working with etcd on TKC with one control plane

In this article, we will see how to work with etcd database of a Tanzu Kubernetes Cluster (TKC) with one control plane node and perform some basic operations. Following is a TKC with one control plane node and three worker nodes:

Get K8s cluster nodes

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE     VERSION
gc-control-plane-6g9gk              Ready    control-plane,master   3d7h    v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-n5qp8   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-wds2m   Ready    <none>                 7d19h   v1.21.6+vmware.1
gc-workers-rmgkm-78cf46d595-z2wvt   Ready    <none>                 7d19h   v1.21.6+vmware.1

Get the etcd pod and describe it

❯ gcc kg pod -A | grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          3d7h
❯
❯ gcc kd pod etcd-gc-control-plane-6g9gk -n kube-system
Name:                 etcd-gc-control-plane-6g9gk
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gc-control-plane-6g9gk/100.68.36.38
Start Time:           Tue, 19 Jul 2022 11:14:25 +0530
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
                      kubernetes.io/config.hash: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.mirror: 6e7bc05d35060112913f78af2043683f
                      kubernetes.io/config.seen: 2022-07-19T05:44:19.416549595Z
                      kubernetes.io/config.source: file
                      kubernetes.io/psp: vmware-system-privileged
Status:               Running
IP:                   100.68.36.38
IPs:
  IP:           100.68.36.38
Controlled By:  Node/gc-control-plane-6g9gk
Containers:
  etcd:
    Container ID:  containerd://253c7b25bd60ea78dfccad52d03534785f0d7b7a1fa7105dbd55d7727f8785c3
    Image:         localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    Image ID:      sha256:78661ebbe1adaee60336a0f8ff031c4537ff309ef51feab6e840e7dbb3cbf47d
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://100.68.36.38:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://100.68.36.38:2380
      --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
      --initial-cluster-state=existing
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://100.68.36.38:2380
      --name=gc-control-plane-6g9gk
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Tue, 19 Jul 2022 11:14:27 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     100Mi
    Liveness:     http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:  <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:            <none>

Exec into the etcd pod and run etcdctl commands

You can use etcdctl and you need to provide cacert, cert, and key details. All these info you will get while describing the etcd pod.

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
❯
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint health --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --write-out json"
[{"endpoint":"127.0.0.1:2379","health":true,"took":"9.689387ms"}]
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key -w json"
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":4073335150581888229,"member_id":14250600431682432472,"revision":2153804,"raft_term":11},"version":"3.4.13","dbSize":24719360,"leader":14250600431682432472,"raftIndex":2429139,"raftTerm":11,"raftAppliedIndex":2429139,"dbSizeInUse":2678784}}]

Snapshot etcd

❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot save snapshotdb-$(date +%d-%m-%y)"
{"level":"info","ts":1658651541.2698474,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb-24-07-22.part"}
{"level":"info","ts":"2022-07-24T08:32:21.277Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1658651541.2771788,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-07-24T08:32:21.594Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1658651541.621639,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"25 MB","took":0.344746859}
{"level":"info","ts":1658651541.621852,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb-24-07-22"}
Snapshot saved at snapshotdb-24-07-22
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh
# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  snapshotdb-24-07-22  srv  sys  tmp  usr	var
#
# exit
❯
❯ gcc k exec -it etcd-gc-control-plane-6g9gk -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot status snapshotdb-24-07-22 -w table"
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+
❯

Copy snapshot file from etcd pod to local machine

Note: Even though there was an error while copying the snapshot file from the pod to local machine, you can see the file was successfully copied and I also verified the snapshot file status using etcdctl. Every field (hash, total_keys, etc.) matches with that of the source file.

❯ gcc kubectl cp kube-system/etcd-gc-control-plane-6g9gk:/snapshotdb-24-07-22 snapshotdb-24-07-22
tar: Removing leading `/' from member names
error: unexpected EOF
❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b0910e83 |  2434362 |       1580 |      25 MB |
+----------+----------+------------+------------+       
❯

Restore etcd

We can restore the etcd snapshot using etcdctl from the TKC control plane node. Inorder to connect to the control plane VM, we need to create a jumpbox pod under the corresponding supervisor namespace.

So, first copy the snapshot file from local machine to jumpbox pod.

❯ ls snapshotdb-24-07-22
snapshotdb-24-07-22
❯ kubectl cp snapshotdb-24-07-22 vineetha-test04-deploy/jumpbox01:/
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- sh
sh-4.4# su
root [ / ]# ls
bin   dev  home  lib64  mnt   root  sbin                 srv  tmp  var
boot  etc  lib   media  proc  run   snapshotdb-24-07-22  sys  usr
root [ / ]#

Copy the snapshot file from jumpbox pod to control plane node.

❯ gcc kg po -A -o wide| grep etcd
kube-system                    etcd-gc-control-plane-6g9gk                         1/1     Running            0          166m   100.68.36.38   gc-control-plane-6g9gk              <none>           <none>
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- scp /snapshotdb-24-07-22 vmware-system-user@100.68.36.38:/tmp
snapshotdb-24-07-22                           100%   20MB 126.1MB/s   00:00
❯
❯ k exec -it jumpbox01 -n vineetha-test04-deploy -- /usr/bin/ssh vmware-system-user@100.68.36.38
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Sun Jul 24 13:02:39 2022 from 100.68.35.210
 13:14:29 up 5 days,  7:38,  0 users,  load average: 0.98, 0.53, 0.31

26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ sudo su
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# cd /tmp/

Install etcd on the control plane node, so that we get to access etcdctl utility.

root [ /tmp ]# tdnf install etcd
root [ /tmp ]# ETCDCTL_API=3 etcdctl member list --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
c5c44d96f675add8, started, gc-control-plane-6g9gk, https://100.68.36.38:2380, https://100.68.36.38:2379, false
root [ /tmp ]#
root [ /tmp ]# ETCDCTL_API=3 etcdctl snapshot status snapshotdb-24-07-22 --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
Deprecated: Use `etcdutl snapshot status` instead.

b0910e83, 2434362, 1580, 25 MB
root [ /tmp ]# hostname
gc-control-plane-6g9gk
root [ /tmp ]# ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot restore /tmp/snapshotdb-24-07-22 --data-dir=/var/lib/etcd-backup --skip-hash-check=true
Deprecated: Use `etcdutl snapshot restore` instead.

2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:251	restoring snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/snapshot/v3_snapshot.go:257\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/command/snapshot_command.go:128\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/share/gocode/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/usr/src/photon/BUILD/etcd-3.5.1/etcdctl/main.go:59\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
2022-07-24T13:20:44Z	info	membership/store.go:141	Trimming membership information from the backend...
2022-07-24T13:20:44Z	info	membership/cluster.go:421	added member	{"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]}
2022-07-24T13:20:44Z	info	snapshot/v3_snapshot.go:272	restored snapshot	{"path": "/tmp/snapshotdb-24-07-22", "wal-dir": "/var/lib/etcd-backup/member/wal", "data-dir": "/var/lib/etcd-backup", "snap-dir": "/var/lib/etcd-backup/member/snap"}
root [ /var/lib ]#

We have restored the database snapshot to a new location: --data-dir=/var/lib/etcd-backup. So we need to modify the etcd-data hostpath to path: /var/lib/etcd-backup in the etcd static pod manifest file (etcd.yaml). Copy the contents of etcd.yaml file.

root [ /var/lib ]# cd /etc/kubernetes/manifests/
root [ /etc/kubernetes/manifests ]# ls
etcd.yaml            kube-controller-manager.yaml  registry.yaml
kube-apiserver.yaml  kube-scheduler.yaml
root [ /etc/kubernetes/manifests ]# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://100.68.36.38:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://100.68.36.38:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://100.68.36.38:2380
    - --initial-cluster=gc-control-plane-6g9gk=https://100.68.36.38:2380,gc-control-plane-64lq5=https://100.68.36.34:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://100.68.36.38:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://100.68.36.38:2380
    - --name=gc-control-plane-6g9gk
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: localhost:5000/vmware.io/etcd:v3.4.13_vmware.22
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

I was having difficulties in the terminal to modify it. So I copied the contents of etcd.yaml file locally, modified the path, removed the existing etcd.yaml file, created new etcd.yaml file, and pasted the modifed content in it.

root [ /etc/kubernetes/manifests ]# rm etcd.yaml
root [ /etc/kubernetes/manifests ]# vi etcd.yaml

<paste the above etcd.yaml file contents, with modified etcd-data hostPath, last part of the yaml will look like below:

 volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd-backup
      type: DirectoryOrCreate
    name: etcd-data
status: {}

>

Once the etcd.yaml is saved, after few seconds you can see that etcd pod will be running.

root [ /etc/kubernetes/manifests ]# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                           ATTEMPT             POD ID
92dad43a85ebc       78661ebbe1ada       1 second ago        Running             etcd                           0                   8258804cf17bb
5c704092eb4bb       25605c4ab20fe       10 seconds ago      Running             csi-resizer                    10                  1fa9feb732df5
495b5ff250cb4       05cfd9e3c3f22       10 seconds ago      Running             csi-provisioner                22                  1fa9feb732df5
c0da43af7f1d0       a145efcc3afb4       11 seconds ago      Running             vsphere-syncer                 10                  1fa9feb732df5
4e6d67dc16f4a       5cb2119a4d797       11 seconds ago      Running             kube-controller-manager        11                  3d1c266aa5e24
8710a7b6a8563       fa70d7ee973ad       11 seconds ago      Running             guest-cluster-cloud-provider   35                  3f3d5eb0929e5
4e0c1e2e72682       f18cde23836f5       11 seconds ago      Running             csi-attacher                   10                  1fa9feb732df5
bf6771ca4dc4d       a609b91a17410       11 seconds ago      Running             kube-scheduler                 11                  f50eada3f8127
05fa3f2f587e8       fa70d7ee973ad       3 hours ago         Exited              guest-cluster-cloud-provider   34                  3f3d5eb0929e5
888ba7ce34d92       3f6d2884f8105       3 hours ago         Running             kube-apiserver                 28                  c368bd9937f8b
6579ffffa5f53       382a8821c56e0       3 hours ago         Running             metrics-server                 21                  622c835648008
bdd0c760d0bc7       382a8821c56e0       3 hours ago         Exited              metrics-server                 20                  622c835648008
6063485d73e38       78661ebbe1ada       3 hours ago         Exited              etcd                           0                   71e6ae04726ad
259595a330d26       3f6d2884f8105       3 hours ago         Exited              kube-apiserver                 27                  c368bd9937f8b
5034f4ea18f1f       25605c4ab20fe       4 hours ago         Exited              csi-resizer                    9                   1fa9feb732df5
21de3c4850dc3       05cfd9e3c3f22       4 hours ago         Exited              csi-provisioner                21                  1fa9feb732df5
d623946b1d270       a145efcc3afb4       4 hours ago         Exited              vsphere-syncer                 9                   1fa9feb732df5
5f50b9e93d287       a609b91a17410       4 hours ago         Exited              kube-scheduler                 10                  f50eada3f8127
ec4e066f54fd6       5cb2119a4d797       4 hours ago         Exited              kube-controller-manager        10                  3d1c266aa5e24
f05fee251a700       f18cde23836f5       4 hours ago         Exited              csi-attacher                   9                   1fa9feb732df5
d3577bf8477d0       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   1fa9feb732df5
d30ba30b8c203       4251b7012fd43       5 days ago          Running             vsphere-csi-controller         0                   1fa9feb732df5
1816ed6aada5f       02abc4bd595a0       5 days ago          Running             guest-cluster-auth-service     0                   bff70bcd389be
3e014a6745f5a       b0f879c3b53ce       5 days ago          Running             liveness-probe                 0                   66e5ca3abfe5b
ba82f29cf8939       4251b7012fd43       5 days ago          Running             vsphere-csi-node               0                   66e5ca3abfe5b
9c29960956718       f3fe18dd8cea2       5 days ago          Running             node-driver-registrar          0                   66e5ca3abfe5b
92aa3a904d72e       0515f8357a522       5 days ago          Running             antrea-agent                   3                   4ae63a9f8e4cb
9e0f32fa88663       0515f8357a522       5 days ago          Exited              antrea-agent                   2                   4ae63a9f8e4cb
ac5573a4d93f8       0515f8357a522       5 days ago          Running             antrea-ovs                     0                   4ae63a9f8e4cb
e14fea2c37c21       0515f8357a522       5 days ago          Exited              install-cni                    0                   4ae63a9f8e4cb
166627360a434       7fde82047d4f6       5 days ago          Running             docker-registry                0                   c84c550a9ab90
ecfbdcd23858d       f31127f4a3471       5 days ago          Running             kube-proxy                     0                   07f5b9be02414
root [ /etc/kubernetes/manifests ]# exit
exit
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$
vmware-system-user@gc-control-plane-6g9gk [ ~ ]$ exit
logout

Verify

In my case I had a namespace vineethac-testing with two nginx pods running under it while the snapshot was taken. After the snapshot was taken, I deleted the two nginx pods and the namespace vineethac-testing. After restoring the etcd snapshot, I can see that the namespace vineethac-testing is active with two nginx pods under it.

❯ gcc kg ns
NAME                           STATUS   AGE
default                        Active   9d
kube-node-lease                Active   9d
kube-public                    Active   9d
kube-system                    Active   9d
vineethac-testing              Active   4h28m
vmware-system-auth             Active   9d
vmware-system-cloud-provider   Active   9d
vmware-system-csi              Active   9d
❯
❯ gcc kg pods -n vineethac-testing
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          4h25m
nginx2   1/1     Running   0          4h23m

Hope it was useful. Cheers!

Note: I've tested this in a lab. This may not be the best practice procedure and may slightly vary in a real world environment.

Saturday, April 9, 2022

Working with Kubernetes using Python - Part 03 - Get nodes

Following code snipet uses kubeconfig python module to switch context and Python client for the kubernetes API to get cluster node details. It takes the default kubeconfig file, and switch to the required context, and get node info of the respective cluster.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>
kubectl get nodes -o json

Code:

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/
https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Saturday, March 19, 2022

Working with Kubernetes using Python - Part 02 - Switch context

Following code snipet uses kubeconfig python module and it takes the default kubeconfig file, and switch to a new context.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>

Code:

Note: If you want to use a specific kubeconfig file, instead of conf = KubeConfig()

you can use conf = KubeConfig('path-to-your-kubeconfig')

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/

Hope it was useful. Cheers!

Saturday, March 5, 2022

Working with Kubernetes using Python - Part 01 - List contexts

Following code snipet uses Python client for the kubernetes API and it takes the default kubeconfig file, list the contexts, and active context.

kubectl commands:

kubectl config get-contexts
kubectl config current-context

Code:

Note: If you want to use a specific kubeconfig file, instead of

contexts, active_context = config.list_kube_config_contexts()

you can use

contexts, active_context = config.list_kube_config_contexts(config_file="path-to-kubeconfig")

Reference:

https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Pages

Sunday, July 17, 2022

Following is a sample log for this issue from the vmop-controller-manger:

Case 3:

Saturday, June 25, 2022

Saturday, June 11, 2022

Saturday, May 21, 2022

Saturday, April 9, 2022

Saturday, March 19, 2022

Saturday, March 5, 2022