Showing posts with label vSphere. Show all posts
Showing posts with label vSphere. Show all posts

Saturday, November 18, 2023

vSphere with Tanzu using NSX-T - Part29 - Logging using Loki stack

Grafana Loki is a log aggregation system that we can use for Kubernetes. In this post we will deploy Loki stack on a Tanzu Kubernetes cluster.

❯ KUBECONFIG=gc.kubeconfig kg no
NAME                                            STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-k8fzb                       Ready    control-plane,master   144m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-4n5kh   Ready    <none>                 132m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-8pcc6   Ready    <none>                 128m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-rx7jf   Ready    <none>                 134m   v1.23.8+vmware.3
❯
❯ helm repo add grafana https://grafana.github.io/helm-charts
❯ helm repo update
❯ helm repo list
❯ helm search repo loki

I saved the values file using helm show values grafana/loki-stack and made necessary modifications as mentioned below. 

  • I enabled Grafana by setting enabled: true. This will create a new Grafana instance.
  • I also added a section under grafana.ingress in the loki-stack/values.yaml, that will create an ingress resource for this new Grafana instance.

 Here is the values.yaml file.

test_pod:
  enabled: true
  image: bats/bats:1.8.2
  pullPolicy: IfNotPresent

loki:
  enabled: true
  isDefault: true
  url: http://{{(include "loki.serviceName" .)}}:{{ .Values.loki.service.port }}
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  livenessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  datasource:
    jsonData: "{}"
    uid: ""


promtail:
  enabled: true
  config:
    logLevel: info
    serverPort: 3101
    clients:
      - url: http://{{ .Release.Name }}:3100/loki/api/v1/push

fluent-bit:
  enabled: false

grafana:
  enabled: true
  sidecar:
    datasources:
      label: ""
      labelValue: ""
      enabled: true
      maxLines: 1000
  image:
    tag: 8.3.5
  ingress:
    ## If true, Grafana Ingress will be created
    ##
    enabled: true

    ## IngressClassName for Grafana Ingress.
    ## Should be provided if Ingress is enable.
    ##
    ingressClassName: nginx

    ## Annotations for Grafana Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    ## Labels to be added to the Ingress
    ##
    labels: {}

    ## Hostnames.
    ## Must be provided if Ingress is enable.
    ##
    # hosts:
    #   - grafana.domain.com
    hosts:
      - grafana-loki-vineethac-poc.test.com

    ## Path for grafana ingress
    path: /

    ## TLS configuration for grafana Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
    # - secretName: grafana-general-tls
    #   hosts:
    #   - grafana.example.com

prometheus:
  enabled: false
  isDefault: false
  url: http://{{ include "prometheus.fullname" .}}:{{ .Values.prometheus.server.service.servicePort }}{{ .Values.prometheus.server.prefixURL }}
  datasource:
    jsonData: "{}"

filebeat:
  enabled: false
  filebeatConfig:
    filebeat.yml: |
      # logging.level: debug
      filebeat.inputs:
      - type: container
        paths:
          - /var/log/containers/*.log
        processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
      output.logstash:
        hosts: ["logstash-loki:5044"]

logstash:
  enabled: false
  image: grafana/logstash-output-loki
  imageTag: 1.0.1
  filters:
    main: |-
      filter {
        if [kubernetes] {
          mutate {
            add_field => {
              "container_name" => "%{[kubernetes][container][name]}"
              "namespace" => "%{[kubernetes][namespace]}"
              "pod" => "%{[kubernetes][pod][name]}"
            }
            replace => { "host" => "%{[kubernetes][node][name]}"}
          }
        }
        mutate {
          remove_field => ["tags"]
        }
      }
  outputs:
    main: |-
      output {
        loki {
          url => "http://loki:3100/loki/api/v1/push"
          #username => "test"
          #password => "test"
        }
        # stdout { codec => rubydebug }
      }

# proxy is currently only used by loki test pod
# Note: If http_proxy/https_proxy are set, then no_proxy should include the
# loki service name, so that tests are able to communicate with the loki
# service.
proxy:
  http_proxy: ""
  https_proxy: ""
  no_proxy: ""

Deploy using Helm

❯ helm upgrade --install --atomic loki-stack grafana/loki-stack --values values.yaml --kubeconfig=gc.kubeconfig --create-namespace --namespace=loki-stack
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: gc.kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: gc.kubeconfig
Release "loki-stack" does not exist. Installing it now.
W1203 13:36:48.286498   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:48.592349   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.840670   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.849356   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: loki-stack
LAST DEPLOYED: Sun Dec  3 13:36:45 2023
NAMESPACE: loki-stack
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a datasource in Grafana.

See http://docs.grafana.org/features/datasources/loki/ for more detail.

 

Verify

❯ KUBECONFIG=gc.kubeconfig kg all -n loki-stack
NAME                                     READY   STATUS    RESTARTS   AGE
pod/loki-stack-0                         1/1     Running   0          89s
pod/loki-stack-grafana-dff58c989-jdq2l   2/2     Running   0          89s
pod/loki-stack-promtail-5xmrj            1/1     Running   0          89s
pod/loki-stack-promtail-cts5j            1/1     Running   0          89s
pod/loki-stack-promtail-frwvw            1/1     Running   0          89s
pod/loki-stack-promtail-wn4dw            1/1     Running   0          89s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/loki-stack              ClusterIP   10.110.208.35    <none>        3100/TCP   90s
service/loki-stack-grafana      ClusterIP   10.104.222.214   <none>        80/TCP     90s
service/loki-stack-headless     ClusterIP   None             <none>        3100/TCP   90s
service/loki-stack-memberlist   ClusterIP   None             <none>        7946/TCP   90s

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-stack-promtail   4         4         4       4            4           <none>          90s

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-stack-grafana   1/1     1            1           90s

NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-stack-grafana-dff58c989   1         1         1       90s

NAME                          READY   AGE
statefulset.apps/loki-stack   1/1     91s

❯ KUBECONFIG=gc.kubeconfig kg ing -n loki-stack
NAME                 CLASS   HOSTS                                 ADDRESS        PORTS   AGE
loki-stack-grafana   nginx   grafana-loki-vineethac-poc.test.com   10.216.24.45   80      7m16s
❯

Now in my case I've an ingress controller and dns resolution in place. If you don't have those configured, you can just port forward the loki-stack-grafana service to view the Grafana dashboard.

To get the username and password you should decode the following secret:

❯ KUBECONFIG=gc.kubeconfig kg secrets -n loki-stack loki-stack-grafana -oyaml

Login to the Grafana instance and verify the Data Sources section, and it must be already configured. Now click on explore option and use the log browser to query logs. 

Hope it was useful. Cheers!

Saturday, February 4, 2023

vSphere with Tanzu using NSX-T - Part23 - Supervisor cluster certificates expiry

Note that the supervisor control plane component certificates will expire after one year. 

Here is the VMware KB: https://kb.vmware.com/s/article/89324

NOTE: If certificates expire on the Supervisor or Guest Clusters, access and management of the clusters will fail. And, you will need to raise a case with VMware support team for assistance.

Keep a note of this cert expiry date, and if you can update the supervisor cluster atleast once in a year, these certs will get updated.

Here is a quick way to check the expiry of the supervisor control plane certs. 

❯ k config current-context
sc2-06-d5165f-vc01

❯ k cluster-info
Kubernetes control plane is running at https://10.43.69.117:6443
KubeDNS is running at https://10.43.69.117:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

❯ echo | openssl s_client -servername 10.43.69.117 -connect 10.43.69.117:6443 | openssl x509 -noout -dates
depth=0 CN = kube-apiserver
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = kube-apiserver
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
notBefore=Jun 2 09:36:17 2023 GMT
notAfter=Jun 1 09:36:18 2024 GMT


Thanks to my friend Ravikrithik Udainath for the above openssl tip!

I am using the admin kubeconfig of the supervisor cluster. Here is the link to my previous article on exporting WCP admin kubeconfig file. In this case, 10.43.69.117 is the floating IP for the supervisor control plane and it is assigned to one of the supervisor control plane VMs.

This vSphere with Tanzu cluster was deployed on June 02, 2023, and as you can see above, the certificate expiry will be after one year, which in this case is June 01, 2024. 

You can set up some sort of monitoring/ alerting for all your supervisor clusters to get notification on these expiry dates. 

Hope it was useful. Cheers!

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state. 

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
5

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-tnhv9 NotReady                    control-plane,master 63d v1.20.9+vmware.1
gc-control-plane-tqvnk NotReady                   control-plane,master 50d v1.20.9+vmware.1
gc-control-plane-wsclb NotReady                   <none> 8d v1.20.9+vmware.1
gc-control-plane-wt6sx NotReady                   <none> 30d v1.20.9+vmware.1
gc-control-plane-zthnq NotReady                   control-plane,master 49d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc updating 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: Ready
Last Transition Time: 2023-01-01T19:19:45Z
Status: True
Type: AddonsReady
Last Transition Time: 2022-12-30T19:47:15Z
Message: Rolling 1 replicas with outdated spec (2 replicas up to date)
Reason: RollingUpdateInProgress
Severity: Warning
Status: False
Type: ControlPlaneReady
Last Transition Time: 2022-07-24T15:53:06Z
Status: True
Type: NodePoolsReady
Last Transition Time: 2022-09-01T09:02:26Z
Message: 3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
Status: True
Type: NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.
❯ gcc kg po -A -o wide | grep etcd
kube-system etcd-gc-control-plane-2rbsb 0/1 Running 811 410d 172.31.14.6 gc-control-plane-2rbsb <none> <none>
kube-system etcd-gc-control-plane-5zjn4 1/1 Running 1 124d 172.31.14.7 gc-control-plane-5zjn4 <none> <none>
kube-system etcd-gc-control-plane-9t97w 1/1 Running 1 123d 172.31.14.8 gc-control-plane-9t97w <none> <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-2rbsb Ready,SchedulingDisabled control-plane,master 410d v1.20.9+vmware.1
gc-control-plane-5zjn4 Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 123d v1.20.9+vmware.1
gc-control-plane-dnr66 gc Provisioning 13s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
gc-control-plane-5zjn4 gc gc-control-plane-5zjn4 vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea Running 124d v1.20.9+vmware.1
gc-control-plane-9t97w gc gc-control-plane-9t97w vsphere://4201377e-0f46-40b6-e222-9c723c6adb19 Running 124d v1.20.9+vmware.1
gc-control-plane-dnr66 gc gc-control-plane-dnr66 vsphere://42011228-b156-3338-752a-e7233c9258dd Running 2m2s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl gc gc-workers-ztr5c-6f4b555879-2v8pl vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37 Running 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p gc gc-workers-ztr5c-6f4b555879-8qs4p vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367 Running 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 gc gc-workers-ztr5c-6f4b555879-r29d5 vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3 Running 458d v1.20.9+vmware.1

❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 NotReady control-plane,master 35s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1


❯ gcc kg no
NAME STATUS ROLES AGE VERSION
gc-control-plane-5zjn4 Ready control-plane,master 124d v1.20.9+vmware.1
gc-control-plane-9t97w Ready control-plane,master 123d v1.20.9+vmware.1
gc-control-plane-dnr66 Ready control-plane,master 53s v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl Ready <none> 458d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p Ready <none> 456d v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5 Ready <none> 458d v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01 gc running 2021-07-29T16:59:34Z v1.20.9+vmware.1-tkg.1.a4cee5b 3 3
Hope it was useful. Cheers!

Saturday, August 13, 2022

vSphere with Tanzu using NSX-T - Part18 - Troubleshooting vSphere pods with ProviderFailed status

In this article, we will take a look at fixing vSphere pods with ProviderFailed status. Following is an example:

svc-opa-gatekeeper-domain-c61                 gatekeeper-controller-manager-5ccbc7fd79-5gn2n                    0/1     ProviderFailed     0          2d14h
svc-opa-gatekeeper-domain-c61 gatekeeper-controller-manager-5ccbc7fd79-5jtvj 0/1 ProviderFailed 0 2d13h
svc-opa-gatekeeper-domain-c61 gatekeeper-controller-manager-5ccbc7fd79-5whtt 0/1 ProviderFailed 0 2d14h
svc-opa-gatekeeper-domain-c61 gatekeeper-controller-manager-5ccbc7fd79-6p2zv 0/1 ProviderFailed 0 2d13h
svc-opa-gatekeeper-domain-c61 gatekeeper-controller-manager-5ccbc7fd79-7r92p 0/1 ProviderFailed 0 2d14h
When describing the pod, you can see the message "Unable to find backing for logical switch".

❯ kd po gatekeeper-controller-manager-5ccbc7fd79-5gn2n -n svc-opa-gatekeeper-domain-c61
Name: gatekeeper-controller-manager-5ccbc7fd79-5gn2n
Namespace: svc-opa-gatekeeper-domain-c61
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: esx-1.sddc-35-82-xxxxx.xxxxxxx.com/
Labels: control-plane=controller-manager
gatekeeper.sh/operation=webhook
gatekeeper.sh/system=yes
pod-template-hash=5ccbc7fd79
Annotations: attachment_id: 668b681b-fef6-43e5-8009-5ac8deb6da11
kubernetes.io/psp: wcp-default-psp
mac: 04:50:56:00:08:1e
vlan: None
vmware-system-ephemeral-disk-uuid: 6000C297-d1ba-ce8c-97ba-683a3c8f5321
vmware-system-image-references: {"manager":"gatekeeper-111fd0f684141bdad12c811b4f954ae3d60a6c27-v52049"}
vmware-system-vm-moid: vm-89777:750f38c6-3b0e-41b7-a94f-4d4aef08e19b
vmware-system-vm-uuid: 500c9c37-7055-1708-92d4-8ffdf932c8f9
Status: Failed
Reason: ProviderFailed
Message: Unable to find backing for logical switch 03f0dcd4-a5d9-431e-ae9e-d796ddca0131: timed out waiting for the condition Unable to find backing for logical switch: 03f0dcd4-a5d9-431e-ae9e-d796ddca0131
IP:
IPs: <none>
A workaround for this is to restart the spherelet service on the ESXi host where you see this issue. If there are multiple ESXi nodes having same issue, you could consider restarting the spherelet service on all ESXi worker nodes. In a production setup you may want to consider placing the ESXi in maintenance mode before restarting the spherelet service. In my case, we usually restart the spherelet service directly without placing the ESXi in MM. Following is the PowerCLI way to check/ restart spherelet service on ESXi worker nodes:
 

> Connect-VIServer wdc-10-vc21

> Get-VMHost | Get-VMHostService | where {$_.Key -eq "spherelet"} | select VMHost,Key,Running | ft

VMHost Key Running
------ --- -------
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True
wdc-10-r0xxxxxxxxxxxxxxxxxxxx spherelet True

> $sphereletservice = Get-VMHost wdc-10-r0xxxxxxxxxxxxxxxxxxxx | Get-VMHostService | where {$_.Key -eq "spherelet"}
> Stop-VMHostService -HostService $sphereletservice

Perform operation?
Perform operation Stop host service. on spherelet?
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): Y

Key Label Policy Running Required
--- ----- ------ ------- --------
spherelet spherelet on False False

> Get-VMHost wdc-10-r0xxxxxxxxxxxxxxxxxxxx | Get-VMHostService | where {$_.Key -eq "spherelet"}

Key Label Policy Running Required
--- ----- ------ ------- --------
spherelet spherelet on False False

> Start-VMHostService -HostService $sphereletservice

Key Label Policy Running Required
--- ----- ------ ------- --------
spherelet spherelet on True False

To restart spherelet service on all ESXi worker nodes of a cluster:
> Get-Cluster

Name HAEnabled HAFailover DrsEnabled DrsAutomationLevel
Level
---- --------- ---------- ---------- ------------------
wdc-10-vcxxc01 True 1 True FullyAutomated

> Get-Cluster -Name wdc-10-vcxxc01 | Get-VMHost | foreach { Restart-VMHostService -HostService ($_ | Get-VMHostService | where {$_.Key -eq "spherelet"}) }

After restarting the spherelet service, new pods will come up fine and be in Running status. But you may need to clean up all those pods with ProviderFailed status using kubectl. 
kubectl get pods -A | grep ProviderFailed | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod

Hope it was useful. Cheers!

Saturday, December 18, 2021

vSphere with Tanzu using NSX-T - Part13 - Export WCP admin kubeconfig

In the previous posts we discussed the following:

This article shows the steps to export WCP admin kubeconfig file from the supervisor control plane VM. This is the admin kubeconfig file that can be used to manage the Supervisor/ WCP K8s cluster.

Step1: SSH as root to the vCenter server.

Step2: Run the script /usr/lib/vmware-wcp/decryptK8Pwd.py and make a note of the IP and PWD.

Step3: SSH as root to the IP that you noted down from previous step, and then provide the password that you got from step2.

Step4: You can now copy the admin kubeconfig file from /etc/kubernetes/admin.conf file to your local machine. Make sure to modify the field server: https://127.0.0.1:6443 in your local admin.conf file to the IP that you got from step2 (server: https://IP_from_step2:6443). 

Note: If you are managing multiple WCP clusters, you can merge all the kubeconfig files. Refer this blog by Jacob Tomlinson for more details. 

Hope it was useful. Cheers!

Tuesday, August 3, 2021

vSphere with Tanzu using NSX-T - Part10 - Upgrade K8s version of Tanzu Kubernetes cluster

In the previous posts we discussed the following:

Part1 - Prerequisites

Part2 - Configure NSX

Part3 - Edge Cluster

Part4 - Tier-0 Gateway and BGP peering

Part5 - Tier-1 Gateway and Segments

Part6 - Create tags, storage policy, and content library

Part7 - Enable workload management


In this article, I will explain how to upgrade the K8s version of a Tanzu Kubernetes cluster.

Verify the current K8s version of the Tanzu Kubernetes cluster.


Check available Tanzu Kubernetes versions.


Edit the cluster manifest file.


Here we are updating from 1.18.5 to 1.18.15.


Save the manifest file.


You can see the corresponding cluster starts updating.


The cluster will get updated to the newer version in a rolling fashion. The control plane node gets updated first, followed by the worker nodes one by one. A new node will be added to the cluster with new version, and an old node will be removed from the cluster.
 


Verify.

As you can see, the tkg-cluster-02 is upgraded from 1.18.5 to 1.18.15.


Hope it was useful. Cheers!

References

Friday, March 19, 2021

vSphere with Tanzu using NSX-T - Part5 - Tier-1 Gateway and Segments

In the previous posts we discussed the following: 

Part1: Prerequisites

Part2: Configure NSX-T

Part3: Edge Cluster

Part4: Tier-0 Gateway and BGP peering


The next step is to create a Tier-1 Gateway and network segments. 

  • Add Tier-1 Gateway.
    • Provide name, select the linked T0 Gateway, and select the route advertisement settings.

  • Add Segment.
    • Provide segment name, connected gateway, transport zone, and subnet.
    • Here we are creating an overlay segment and the subnet CIDR 172.16.10.1/24 will be the gateway IP for this segment.

Now, let's verify whether this segment is being advertised (route advertisement) or not. Following is the screenshot from both edge nodes and you can see that the Tier-0 SR is aware of 172.16.10.0/24 network:


As Tier-0 Gateway is connected to the TOR switches via BGP, we can verify whether the TOR switches are aware about this newly created segment. 


You can see that the TORs are aware of 172.16.10.0/24 network via BGP. Let's connect a VM to this segment, assign an IP address, and test network connectivity. 


You can also view the network topology from NSX-T.


This is the traffic flow: VM - Network segment - Tier-1 Gateway - Tier-0 Gateway - BGP peering - TOR switches - NAT VM - External network.

Hope this was useful. Cheers!

Sunday, February 7, 2021

vSphere with Tanzu using NSX-T - Part4 - Tier-0 Gateway and BGP peering

In the previous posts we discussed the following: 

Part1: Prerequisites
Part2: Configure NSX-T
Part3: Edge Cluster


The next step is to create a Tier-0 Gateway, configure its interfaces, and BGP peer with the L3 TOR switches. Following is a high-level logical representation of this configuration:


Configure Tier-0 Gateway


Before creating the T0-Gateway, we need to create two segments.

  • Add Segments.
    • Create a segment "ls-uplink-v54"
      • VLAN: 54
      • Transport Zone: "edge-vlan-tz"
    • Create a segment "ls-uplink-v55"
      • VLAN: 55
      • Transport Zone: "edge-vlan-tz"

  • Add Tier-0 Gateway.
    • Provide the necessary details as shown below.

    • Add 4 interfaces and configure them as per the logical diagram given above.
      • edge-01-uplink1 - 192.168.54.254/24 - connected via segment ls-uplink-v54
      • edge-01-uplink2 - 192.168.55.254/24 - connected via segment ls-uplink-v55

      • edge-02-uplink1 - 192.168.54.253/24 - connected via segment ls-uplink-v54
      • edge-02-uplink2 - 192.168.55.253/24 - connected via segment ls-uplink-v55
    • Verify the status is showing success for all the 4 interfaces that you added.

  • Routing and multicast settings of T0 are as follows:

    • You can see a static route is configured. The next hop for the default route 0.0.0.0/0 is set to 192.168.54.1. 

    • The next hop configuration is given below.

  • BGP settings of T0 are shown below.

    • BGP Neighbor config:

    • Verify the status is showing success for the two BGP Neighbors that you added.

  • Route re-distribution settings of T0:

    • Add route re-distribution.
    • Set route re-distribution.

Now, the T0 configuration is complete. The next step is to configure BGP on the Dell S4048-ON TOR switches.

Configure TOR Switches


---On TOR A---
conf
router bgp 65500
neighbor 192.168.54.254 remote-as 65400
#peering to T0 edge-01 interface
neighbor 192.168.54.254 no shutdown
neighbor 192.168.54.253 remote-as 65400
#peering to T0 edge-02 interface
neighbor 192.168.54.253 no shutdown
neighbor 192.168.54.3 remote-as 65500
#peering to TOR B in VLAN 54
neighbor 192.168.54.3 no shutdown
maximum-paths ebgp 4
maximum-paths ibgp 4


---On TOR B---
conf
router bgp 65500
neighbor 192.168.55.254 remote-as 65400
#peering to T0 edge-01 interface
neighbor 192.168.55.254 no shutdown
neighbor 192.168.55.253 remote-as 65400
#peering to T0 edge-02 interface
neighbor 192.168.55.253 no shutdown
neighbor 192.168.54.2 remote-as 65500
#peering to TOR A in VLAN 54
neighbor 192.168.54.2 no shutdown
maximum-paths ebgp 4
maximum-paths ibgp 4


---Advertising ESXi mgmt and VM traffic networks in BGP on both TORs---

conf
router bgp 65500
network 192.168.41.0/24
network 192.168.43.0/24


Thanks to my friend and vExpert Harikrishnan @hari5611 for helping me with the T0 configs and BGP peering on TORs. Do check out his blog https://vxplanet.com/


Verify BGP Configurations


The next step is to verify the BGP configs on TORs using the following commands:

show running-config bgp

show ip bgp summary

show ip bgp neighbors


Follow the VMware documentation to verify the BGP connections from a Tier-0 Service Router. In the below screenshot you can see that both Edge nodes have the BGP neighbors 192.168.54.2 and 192.168.55.3 with state Estab.


In the next article, I will talk about adding a T1 Gateway, adding new segments for apps, connecting VMs to the segments, and verify connectivity to different internal and external networks. I hope this was useful. Cheers!

Monday, December 28, 2020

vSphere with Tanzu using NSX-T - Part3 - Edge Cluster

In the previous post, we went through basic NSX-T configurations. The next step is to deploy Edge VMs and create an Edge cluster. Before creating Edge VMs, we need to create two trunk port groups on the VDS using the VCSA web UI. Uplinks of the Edge VMs will be connected to these two trunk port groups.

  • Create 2 trunk port groups on the VDS.
    • "Trunk01-NSXT-Edge"
    • "Trunk02-NSXT-Edge"


  • Add Edge Transport Nodes.
    • Create two Edge VMs "edge-01, edge-02".
    • Click ADD EDGE VM, follow the wizard, and provide all the details.






  • Click Finish. Follow the same process and deploy one more Edge VM.

  • The next step is to create an Edge Cluster "edge-cluster-01and add the two Edge VMs to it.


The NSX-T Edge Cluster is now ready. Next, we have to add a Tier-0 Gateway and configure BGP peering with the router or L3 TOR switches. This will be covered in the next part. Hope it was useful. Cheers!

Related posts


Part1: Prerequisites
Part2: Configure NSX-T