vineethac.blogspot.com: kubeconfig

Showing posts with label kubeconfig. Show all posts

Monday, July 1, 2024

vSphere with Tanzu using NSX-T - Part34 - CPU and Memory utilization of a supervisor cluster

vSphere with Tanzu is a Kubernetes-based platform for deploying and managing containerized applications. As with any cloud-native platform, it's essential to monitor the performance and utilization of the underlying infrastructure to ensure optimal resource allocation and avoid any potential issues. In this blog post, we'll explore a Python script that can be used to check the CPU and memory allocation/ usage of a WCP Supervisor cluster.

You can access the Python script from my GitHub repository: https://github.com/vineethac/VMware/tree/main/vSphere_with_Tanzu/wcp_cluster_util

Sample screenshot of the output

The script uses the Kubernetes Python client library (kubernetes) to connect to the Supervisor cluster using the admin kubeconfig and retrieve information about the nodes and their resource utilization. The script then calculates the average CPU and memory utilization across all nodes and prints the results to the console.

Note: In my case instead of running it as a script every time, I made it an executable plugin and copied it to the system executable path. I placed it in $HOME/.krew/bin in my laptop.

Hope it was useful. Cheers!

Saturday, June 22, 2024

vSphere with Tanzu using NSX-T - Part31 - Troubleshooting inaccessible TKC with expired control plane certs

In the course of managing multiple Tanzu Kubernetes Clusters (TKC), I encountered an unexpected issue: the control plane certificates had expired, preventing us from accessing the cluster using the kubeconfig file. To make matters worse, we were unable to SSH into the TKC control plane Virtual Machines (VMs) due to the vmware-system-user password expiring in accordance with STIG Hardening.

The recommended workaround for updating the vmware-system-user password expiry involves applying a specific daemonset on Guest Clusters. However, this approach requires access to the TKC using its admin kubeconfig file, which was unavailable due to the expired certificates.

Warning: In case of critical production issues that affect the accessibility of your Tanzu Kubernetes Cluster (TKC), it is strongly advised to submit a product support request to our team for assistance. This will ensure that you receive expert guidance and a timely resolution to help minimize the impact on your environment.

To resolve this issue, I followed an alternative workaround: I reset the root password of the TKC control plane VMs through the vCenter VM console, as outlined in this knowledge base article. Once the root password was reset, I was able to log directly into the TKC control plane VM using the VM console.

After gaining access to the TKC control plane VM, I proceeded to renew the control plane certificates using kubeadm, as detailed in this blog post. It's essential to apply this process to all control plane nodes in your cluster to ensure proper functionality.

root [ /etc/kubernetes ]# kubeadm certs check-expiration

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Although this workaround required some additional steps, it ultimately allowed us to regain access to our Tanzu Kubernetes Cluster and maintain its security and functionality.

Hope it was useful. Cheers!

Saturday, May 25, 2024

vSphere with Tanzu using NSX-T - Part30 - Troubleshooting inaccessible TKC with server pool members missing in the LB VS

Encountering issues with connectivity to your TKC apiserver/ control plane can be frustrating. One common problem we've seen is the kubeconfig failing to connect, often due to missing server pool members in the load balancer's virtual server (LB VS).

The Issue

The LB VS, which operates on port 6443, should have the control plane VMs listed as its member servers. When these members are missing, connectivity problems arise, disrupting your access to the TKC apiserver.

Troubleshooting steps

Access the TKC: Use the kubeconfig to access the TKC.

❯ KUBECONFIG=tkc.kubeconfig kubectl get node
Unable to connect to the server: dial tcp 10.191.88.4:6443: i/o timeout
❯

Check the Load Balancer: In NSX-T, verify the status of the corresponding load balancer (LB). It may display a green status indicating success.
Inspect Virtual Servers: Check the virtual servers in the LB, particularly on port 6443. They might show as down.
Examine Server Pool Members: Look into the server pool members of the virtual server. You may find it empty.
SSH to Control Plane Nodes: Attempt to SSH into the TKC control plane nodes.

Run Diagnostic Commands: Execute diagnostic commands inside the control plane nodes to verify their status. The issue could be that the control plane VMs are in a hung state, and the container runtime is not running.

vmware-system-user@tkc-infra-r68zc-jmq4j [ ~ ]$ sudo su
root [ /home/vmware-system-user ]# crictl ps
FATA[0002] failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl is-active containerd
Failed to retrieve unit state: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl status containerd
WARNING: terminal is not fully functional
-  (press RETURN)Failed to get properties: Failed to activate service 'org.freedesktop.systemd1'>
lines 1-1/1 (END)lines 1-1/1 (END)

Check VM Console: From vCenter, check the console of the control plane VMs. You might see specific errors indicating issues.

EXT4-fs (sda3): Delayed block allocation failed for inode 266704 at logical offset 10515 with max blocks 2 with error 5
EXT4-fs (sda3): This should not happen!! Data will be lost
EXT4-fs error (device sda3) in ext4_writepages:2905: IO failure
EXT4-fs error (device sda3) in ext4_reserve_inode_write:5947: Journal has aborted
EXT4-fs error (device sda3) xxxxxx-xxx-xxxx: unable to read itable block
EXT4-fs error (device sda3) in ext4_journal_check_start:61: Detected aborted journal
systemd[1]: Caught <BUS>, dumped core as pid 24777.
systemd[1]: Freezing execution.

Restart Control Plane VMs: Restart the control plane VMs. Note that sometimes your admin credentials or administrator@vsphere.local credentials may not allow you to restart the TKC VMs. In such cases, decode the username and password from the relevant secret and use these credentials to connect to vCenter and restart the hung TKC VMs.

❯ kubectx wdc-01-vc17
Switched to context "wdc-01-vc17".
❯
❯ kg secret -A | grep wcp
kube-system                                 wcp-authproxy-client-secret                                               kubernetes.io/tls                                  3      291d
kube-system                                 wcp-authproxy-root-ca-secret                                              kubernetes.io/tls                                  3      291d
kube-system                                 wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-nsop                          wcp-nsop-sa-vc-auth                                                       Opaque                                             2      291d
vmware-system-nsx                           wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-vmop                          wcp-vmop-sa-vc-auth                                                       Opaque                                             2      291d
❯
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth
NAME                  TYPE     DATA   AGE
wcp-vmop-sa-vc-auth   Opaque   2      291d
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth -oyaml
apiVersion: v1
data:
  password: aWAmbHUwPCpKe1Uxxxxxxxxxxxx=
  username: d2NwLXZtb3AtdXNlci1kb21haW4tYzEwMDYtMxxxxxxxxxxxxxxxxxxxxxxxxQHZzcGhlcmUubG9jYWw=
kind: Secret
metadata:
  creationTimestamp: "2022-10-24T08:32:26Z"
  name: wcp-vmop-sa-vc-auth
  namespace: vmware-system-vmop
  resourceVersion: "336557268"
  uid: dcbdac1b-18bb-438c-ba11-76ed4d6bef63
type: Opaque
❯

***Decrypt the username and password from the secret and use it to connect to the vCenter.
***Following is an example using PowerCLI:

PS /Users/vineetha> get-vm gc-control-plane-f266h

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VMGuest
Restart-VMGuest: 08/04/2023 22:20:20	Restart-VMGuest		Operation "Restart VM guest" failed for VM "gc-control-plane-f266h" for the following reason: A general system error occurred: Invalid fault
PS /Users/vineetha>
PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VM

Confirm
Are you sure you want to perform this action?
Performing the operation "Restart-VM" on target "VM 'gc-control-plane-f266h'".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): Y

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha>

Verify System Pods and Connectivity: Once the control plane VMs are restarted, the system pods inside them will start, and the apiserver will become accessible using the kubeconfig. You should also see the previously missing server pool members reappear in the corresponding LB virtual server, and the virtual server on port 6443 will be up and show a success status.

Following these steps should help you resolve the connectivity issues with your TKC apiserver/control plane effectively.Ensuring that your load balancer's virtual server is correctly configured with the appropriate member servers is crucial for maintaining seamless access. This runbook aims to guide you through the process, helping you get your TKC apiserver back online swiftly.

Note: If required for critical production issues related to TKC accessibility I strongly recommend to raise a product support request.

Hope it was useful. Cheers!

Saturday, November 18, 2023

vSphere with Tanzu using NSX-T - Part29 - Logging using Loki stack

Grafana Loki is a log aggregation system that we can use for Kubernetes. In this post we will deploy Loki stack on a Tanzu Kubernetes cluster.

❯ KUBECONFIG=gc.kubeconfig kg no
NAME                                            STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-k8fzb                       Ready    control-plane,master   144m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-4n5kh   Ready    <none>                 132m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-8pcc6   Ready    <none>                 128m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-rx7jf   Ready    <none>                 134m   v1.23.8+vmware.3
❯

❯ helm repo add grafana https://grafana.github.io/helm-charts
❯ helm repo update
❯ helm repo list
❯ helm search repo loki

I saved the values file using helm show values grafana/loki-stack and made necessary modifications as mentioned below.

I enabled Grafana by setting enabled: true. This will create a new Grafana instance.
I also added a section under grafana.ingress in the loki-stack/values.yaml, that will create an ingress resource for this new Grafana instance.

Here is the values.yaml file.

test_pod:
  enabled: true
  image: bats/bats:1.8.2
  pullPolicy: IfNotPresent

loki:
  enabled: true
  isDefault: true
  url: http://{{(include "loki.serviceName" .)}}:{{ .Values.loki.service.port }}
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  livenessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  datasource:
    jsonData: "{}"
    uid: ""


promtail:
  enabled: true
  config:
    logLevel: info
    serverPort: 3101
    clients:
      - url: http://{{ .Release.Name }}:3100/loki/api/v1/push

fluent-bit:
  enabled: false

grafana:
  enabled: true
  sidecar:
    datasources:
      label: ""
      labelValue: ""
      enabled: true
      maxLines: 1000
  image:
    tag: 8.3.5
  ingress:
    ## If true, Grafana Ingress will be created
    ##
    enabled: true

    ## IngressClassName for Grafana Ingress.
    ## Should be provided if Ingress is enable.
    ##
    ingressClassName: nginx

    ## Annotations for Grafana Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    ## Labels to be added to the Ingress
    ##
    labels: {}

    ## Hostnames.
    ## Must be provided if Ingress is enable.
    ##
    # hosts:
    #   - grafana.domain.com
    hosts:
      - grafana-loki-vineethac-poc.test.com

    ## Path for grafana ingress
    path: /

    ## TLS configuration for grafana Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
    # - secretName: grafana-general-tls
    #   hosts:
    #   - grafana.example.com

prometheus:
  enabled: false
  isDefault: false
  url: http://{{ include "prometheus.fullname" .}}:{{ .Values.prometheus.server.service.servicePort }}{{ .Values.prometheus.server.prefixURL }}
  datasource:
    jsonData: "{}"

filebeat:
  enabled: false
  filebeatConfig:
    filebeat.yml: |
      # logging.level: debug
      filebeat.inputs:
      - type: container
        paths:
          - /var/log/containers/*.log
        processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
      output.logstash:
        hosts: ["logstash-loki:5044"]

logstash:
  enabled: false
  image: grafana/logstash-output-loki
  imageTag: 1.0.1
  filters:
    main: |-
      filter {
        if [kubernetes] {
          mutate {
            add_field => {
              "container_name" => "%{[kubernetes][container][name]}"
              "namespace" => "%{[kubernetes][namespace]}"
              "pod" => "%{[kubernetes][pod][name]}"
            }
            replace => { "host" => "%{[kubernetes][node][name]}"}
          }
        }
        mutate {
          remove_field => ["tags"]
        }
      }
  outputs:
    main: |-
      output {
        loki {
          url => "http://loki:3100/loki/api/v1/push"
          #username => "test"
          #password => "test"
        }
        # stdout { codec => rubydebug }
      }

# proxy is currently only used by loki test pod
# Note: If http_proxy/https_proxy are set, then no_proxy should include the
# loki service name, so that tests are able to communicate with the loki
# service.
proxy:
  http_proxy: ""
  https_proxy: ""
  no_proxy: ""

Deploy using Helm

❯ helm upgrade --install --atomic loki-stack grafana/loki-stack --values values.yaml --kubeconfig=gc.kubeconfig --create-namespace --namespace=loki-stack
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: gc.kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: gc.kubeconfig
Release "loki-stack" does not exist. Installing it now.
W1203 13:36:48.286498   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:48.592349   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.840670   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.849356   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: loki-stack
LAST DEPLOYED: Sun Dec  3 13:36:45 2023
NAMESPACE: loki-stack
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a datasource in Grafana.

See http://docs.grafana.org/features/datasources/loki/ for more detail.

Verify

❯ KUBECONFIG=gc.kubeconfig kg all -n loki-stack
NAME                                     READY   STATUS    RESTARTS   AGE
pod/loki-stack-0                         1/1     Running   0          89s
pod/loki-stack-grafana-dff58c989-jdq2l   2/2     Running   0          89s
pod/loki-stack-promtail-5xmrj            1/1     Running   0          89s
pod/loki-stack-promtail-cts5j            1/1     Running   0          89s
pod/loki-stack-promtail-frwvw            1/1     Running   0          89s
pod/loki-stack-promtail-wn4dw            1/1     Running   0          89s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/loki-stack              ClusterIP   10.110.208.35    <none>        3100/TCP   90s
service/loki-stack-grafana      ClusterIP   10.104.222.214   <none>        80/TCP     90s
service/loki-stack-headless     ClusterIP   None             <none>        3100/TCP   90s
service/loki-stack-memberlist   ClusterIP   None             <none>        7946/TCP   90s

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-stack-promtail   4         4         4       4            4           <none>          90s

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-stack-grafana   1/1     1            1           90s

NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-stack-grafana-dff58c989   1         1         1       90s

NAME                          READY   AGE
statefulset.apps/loki-stack   1/1     91s

❯ KUBECONFIG=gc.kubeconfig kg ing -n loki-stack
NAME                 CLASS   HOSTS                                 ADDRESS        PORTS   AGE
loki-stack-grafana   nginx   grafana-loki-vineethac-poc.test.com   10.216.24.45   80      7m16s
❯

Now in my case I've an ingress controller and dns resolution in place. If you don't have those configured, you can just port forward the loki-stack-grafana service to view the Grafana dashboard.

To get the username and password you should decode the following secret:

❯ KUBECONFIG=gc.kubeconfig kg secrets -n loki-stack loki-stack-grafana -oyaml

Login to the Grafana instance and verify the Data Sources section, and it must be already configured. Now click on explore option and use the log browser to query logs.

Hope it was useful. Cheers!

Sunday, May 7, 2023

Kubernetes 101 - Part9 - kubeconfig certificate expiration

You can verify the expiration date of kubeconfig in the current context as follows:

kubectl config view --minify --raw --output 'jsonpath={..user.client-certificate-data}' | base64 -d | openssl x509 -noout -enddate

❯ k config current-context
sc2-01-vcxx
❯
❯ kubectl config view --minify --raw --output 'jsonpath={..user.client-certificate-data}' | base64 -d | openssl x509 -noout -enddate
notAfter=Sep  6 05:13:47 2023 GMT
❯
❯ date
Thu Sep  7 18:05:52 IST 2023
❯

Hope it was useful. Cheers!

Saturday, April 8, 2023

vSphere with Tanzu using NSX-T - Part24 - Kubernetes component certs in TKC

The Kubernetes component certificates inside a TKC (Tanzu Kubernetes Cluster) has lifetime of 1 year. If you manage to upgrade your TKC atleast once a year, these certs will get rotated automatically.

IMPORTANT NOTES:

As per this VMware KB, if TKGS Guest Cluster certificates are expired, you will need to engage VMware support to manually rotate them.
Following troubleshooting steps and workaround are based on studies conducted on my dev/ test/ lab setup, and I will NOT recommend anyone to follow these on your production environment.

Symptom:

❯ KUBECONFIG=tkc.kubeconfig kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid

Troubleshooting:

Verify the certificate expiry of the tkc kubeconfig file itself.

❯ grep client-certificate-data tkc.kubeconfig | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Mar  7 18:26:10 2024 GMT

Create a jumpbox pod and ssh to TKC control plane nodes.
Verify system pods and check logs from apiserver and etcd pods. Sample etcd pod logs are given below:

2023-04-11 07:09:00.268792 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268835 W | rafthttp: health check for peer b5bab7da6e326a7c could not connect: x509: certificate has expired or is not yet valid: current time 2023-04-11T07:08:57Z is after 2023-04-06T06:17:56Z
2023-04-11 07:09:00.268841 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.268869 W | rafthttp: health check for peer 19b6b0bf00e81f0b could not connect: remote error: tls: bad certificate
2023-04-11 07:09:00.310030 I | embed: rejected connection from "172.31.20.27:35362" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.312806 I | embed: rejected connection from "172.31.20.27:35366" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.321449 I | embed: rejected connection from "172.31.20.19:35034" (error "remote error: tls: bad certificate", ServerName "")
2023-04-11 07:09:00.322192 I | embed: rejected connection from "172.31.20.19:35036" (error "remote error: tls: bad certificate", ServerName "")

Verify whether admin.conf inside the control plane node has expired.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Mar  8 18:10:15 2022 GMT
notAfter=Apr  6 06:05:46 2023 GMT

Verify Kubernetes component certs in all the control plane nodes.

root [ /etc/kubernetes ]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Apr 06, 2023 06:05 UTC   <invalid>                               no
apiserver                  Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
apiserver-etcd-client      Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Apr 06, 2023 06:05 UTC   <invalid>       ca                      no
controller-manager.conf    Apr 06, 2023 06:05 UTC   <invalid>                               no
etcd-healthcheck-client    Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
etcd-server                Apr 06, 2023 06:05 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Apr 06, 2023 06:05 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Apr 06, 2023 06:05 UTC   <invalid>                               no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Mar 05, 2032 18:15 UTC   8y              no
etcd-ca                 Mar 05, 2032 18:15 UTC   8y              no
front-proxy-ca          Mar 05, 2032 18:15 UTC   8y              no

Workaround:

Renew Kubernetes component certs on control plane nodes if expired using kubeadm certs renew all.

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Verify:

Verify using the following steps on all the TKC control plane nodes.

root [ /etc/kubernetes ]# grep client-certificate-data admin.conf | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

root [ /etc/kubernetes ]# kubeadm certs check-expiration

Try connect to the TKC using tkc.kubeconfig.

KUBECONFIG=tkc.kubeconfig kubectl get node

Hope it was useful. Cheers!

References:

https://kb.vmware.com/s/article/86251

https://kb.vmware.com/s/article/89324

Saturday, February 4, 2023

vSphere with Tanzu using NSX-T - Part23 - Supervisor cluster certificates expiry

Note that the supervisor control plane component certificates will expire after one year.

Here is the VMware KB: https://kb.vmware.com/s/article/89324

NOTE: If certificates expire on the Supervisor or Guest Clusters, access and management of the clusters will fail. And, you will need to raise a case with VMware support team for assistance.

Keep a note of this cert expiry date, and if you can update the supervisor cluster atleast once in a year, these certs will get updated.

Here is a quick way to check the expiry of the supervisor control plane certs.

❯ k config current-context
sc2-06-d5165f-vc01
❯
❯ k cluster-info
Kubernetes control plane is running at https://10.43.69.117:6443
KubeDNS is running at https://10.43.69.117:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
❯
❯ echo | openssl s_client -servername 10.43.69.117 -connect  10.43.69.117:6443 | openssl x509 -noout -dates
depth=0 CN = kube-apiserver
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = kube-apiserver
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
notBefore=Jun  2 09:36:17 2023 GMT
notAfter=Jun  1 09:36:18 2024 GMT
❯

Thanks to my friend Ravikrithik Udainath for the above openssl tip!

I am using the admin kubeconfig of the supervisor cluster. Here is the link to my previous article on exporting WCP admin kubeconfig file. In this case, 10.43.69.117 is the floating IP for the supervisor control plane and it is assigned to one of the supervisor control plane VMs.

This vSphere with Tanzu cluster was deployed on June 02, 2023, and as you can see above, the certificate expiry will be after one year, which in this case is June 01, 2024.

You can set up some sort of monitoring/ alerting for all your supervisor clusters to get notification on these expiry dates.

Hope it was useful. Cheers!

Sunday, November 13, 2022

vSphere with Tanzu using NSX-T - Part20 - Safely deleting NotReady nodes from a TKC

In this article we will look at a TKC that is stuck at updating phase which has multiple Kubernetes nodes in NotReady state.

jtimothy-napp01     gc    updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

❯ gcc kg no | grep NotReady | wc -l
       5

❯ gcc kg no
NAME                                STATUS                        ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready                         control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                         control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-tnhv9              NotReady                      control-plane,master   63d    v1.20.9+vmware.1
gc-control-plane-tqvnk              NotReady                      control-plane,master   50d    v1.20.9+vmware.1
gc-control-plane-wsclb              NotReady                      <none>                 8d     v1.20.9+vmware.1
gc-control-plane-wt6sx              NotReady                      <none>                 30d    v1.20.9+vmware.1
gc-control-plane-zthnq              NotReady                      control-plane,master   49d    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                         <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                         <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                         <none>                 458d   v1.20.9+vmware.1

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

Note: gcc is alias that I am using for KUBECONFIG=gckubeconfig, where gckubeconfig is the kubeconfig file for the TKC under consideration.

Lets verify where etcd pods are running.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pods are running on nodes that are in Ready status. So now we can go ahead and safely drain and delete the nodes that are NotReady.

❯ notreadynodes=$(gcc kubectl get nodes | grep NotReady | awk '{print $1;}')

❯ echo $notreadynodes
gc-control-plane-tnhv9
gc-control-plane-tqvnk
gc-control-plane-wsclb
gc-control-plane-wt6sx
gc-control-plane-zthnq

❯ echo "$notreadynodes" | while IFS= read -r line ; do echo $line; gcc kubectl drain $line --ignore-daemonsets; gcc kubectl delete node $line; echo "----"; done

gc-control-plane-tnhv9
node/gc-control-plane-tnhv9 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nzbgq, kube-system/kube-proxy-2jqqr, vmware-system-csi/vsphere-csi-node-46g6r
node/gc-control-plane-tnhv9 drained
node "gc-control-plane-tnhv9" deleted
----
gc-control-plane-tqvnk
node/gc-control-plane-tqvnk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-45xfc, kube-system/kube-proxy-dxrkr, vmware-system-csi/vsphere-csi-node-wrvlk
node/gc-control-plane-tqvnk drained
node "gc-control-plane-tqvnk" deleted
----
gc-control-plane-wsclb
node/gc-control-plane-wsclb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-5t254, kube-system/kube-proxy-jt2dp, vmware-system-csi/vsphere-csi-node-w2bhf
node/gc-control-plane-wsclb drained
node "gc-control-plane-wsclb" deleted
----
gc-control-plane-wt6sx
node/gc-control-plane-wt6sx already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-24pn5, kube-system/kube-proxy-b5vl5, vmware-system-csi/vsphere-csi-node-hfjdw
node/gc-control-plane-wt6sx drained
node "gc-control-plane-wt6sx" deleted
----
gc-control-plane-zthnq
node/gc-control-plane-zthnq already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-vp895, kube-system/kube-proxy-8mg8n, vmware-system-csi/vsphere-csi-node-hs22g
node/gc-control-plane-zthnq drained
node "gc-control-plane-zthnq" deleted
----

❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready    control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯
❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01    gc       updating       2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Now, I waited for few minutes to see whether the reconciliation process will proceed and change the status of the TKC from updating to running. But it was still stuck at updating phase. So I described the TKC.

Conditions:
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-01-01T19:19:45Z
    Status:                True
    Type:                  AddonsReady
    Last Transition Time:  2022-12-30T19:47:15Z
    Message:               Rolling 1 replicas with outdated spec (2 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-07-24T15:53:06Z
    Status:                True
    Type:                  NodePoolsReady
    Last Transition Time:  2022-09-01T09:02:26Z
    Message:               3/3 Control Plane Node(s) healthy. 3/3 Worker Node(s) healthy
    Status:                True
    Type:                  NodesHealthy

Checked vmop logs.

vmware-system-vmop/vmware-system-vmop-controller-manager-85d8986b94-xzd9h[manager]: E0103 08:43:51.449422       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.31.14.6:6443: connect: connection refused" "vmName"="jtimothy-napp01/gc-control-plane-2rbsb" "result"=-1

It says something is wrong with CP node gc-control-plane-2rbsb.

❯ gcc kg po -A -o wide | grep etcd
kube-system                    etcd-gc-control-plane-2rbsb                         0/1     Running            811        410d    172.31.14.6       gc-control-plane-2rbsb              <none>           <none>
kube-system                    etcd-gc-control-plane-5zjn4                         1/1     Running            1          124d    172.31.14.7       gc-control-plane-5zjn4              <none>           <none>
kube-system                    etcd-gc-control-plane-9t97w                         1/1     Running            1          123d    172.31.14.8       gc-control-plane-9t97w              <none>           <none>

You can see etcd pod is not running on first control plane node and is getting continuously restarted. So lets try to drain the CP node gc-control-plane-2rbsb.

❯ gcc k drain gc-control-plane-2rbsb
node/gc-control-plane-2rbsb cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "gc-control-plane-2rbsb", aborting command...

There are pending nodes to be drained:
 gc-control-plane-2rbsb
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
cannot delete Pods with local storage (use --delete-emptydir-data to override): vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn

❯ gcc k drain gc-control-plane-2rbsb --ignore-daemonsets --delete-emptydir-data
node/gc-control-plane-2rbsb already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-bdjp7, kube-system/kube-proxy-v9cqf, vmware-system-auth/guest-cluster-auth-svc-n4h2k, vmware-system-csi/vsphere-csi-node-djhpv
evicting pod vmware-system-csi/vsphere-csi-controller-b4fd6878d-zw5hn
pod/vsphere-csi-controller-b4fd6878d-zw5hn evicted
node/gc-control-plane-2rbsb evicted

❯ gcc kg no
NAME                                STATUS                     ROLES                  AGE    VERSION
gc-control-plane-2rbsb              Ready,SchedulingDisabled   control-plane,master   410d   v1.20.9+vmware.1
gc-control-plane-5zjn4              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready                      control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready                      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready                      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready                      <none>                 458d   v1.20.9+vmware.1

Now lets delete its corresponding machine object.

❯ k delete machine.cluster.x-k8s.io/gc-control-plane-2rbsb -n jtimothy-napp01
machine.cluster.x-k8s.io "gc-control-plane-2rbsb" deleted
❯
❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1
❯

After few minutes you can see a new machine and the corresponding node got provisioned and the TKC changed from updating to running phase.

❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE          AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running        124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running        123d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc                                                                                             Provisioning   13s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running        458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running        456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running        458d   v1.20.9+vmware.1



❯ kg machine -n jtimothy-napp01
NAME                                CLUSTER   NODENAME                            PROVIDERID                                       PHASE     AGE    VERSION
gc-control-plane-5zjn4              gc        gc-control-plane-5zjn4              vsphere://42015c9c-feed-5eda-6fbe-f0da5d1434ea   Running   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              gc        gc-control-plane-9t97w              vsphere://4201377e-0f46-40b6-e222-9c723c6adb19   Running   124d   v1.20.9+vmware.1
gc-control-plane-dnr66              gc        gc-control-plane-dnr66              vsphere://42011228-b156-3338-752a-e7233c9258dd   Running   2m2s   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   gc        gc-workers-ztr5c-6f4b555879-2v8pl   vsphere://420139b4-83f1-824f-7bd2-ed073a5dcf37   Running   458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   gc        gc-workers-ztr5c-6f4b555879-8qs4p   vsphere://4201d8ac-9cc2-07ac-c352-9f7e812b4367   Running   456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   gc        gc-workers-ztr5c-6f4b555879-r29d5   vsphere://42017666-8cb4-2767-5d0b-1d3dc9219db3   Running   458d   v1.20.9+vmware.1
❯
❯ gcc kg no
NAME                                STATUS     ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready      control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready      control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              NotReady   control-plane,master   35s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready      <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready      <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready      <none>                 458d   v1.20.9+vmware.1


❯ gcc kg no
NAME                                STATUS   ROLES                  AGE    VERSION
gc-control-plane-5zjn4              Ready    control-plane,master   124d   v1.20.9+vmware.1
gc-control-plane-9t97w              Ready    control-plane,master   123d   v1.20.9+vmware.1
gc-control-plane-dnr66              Ready    control-plane,master   53s    v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-2v8pl   Ready    <none>                 458d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-8qs4p   Ready    <none>                 456d   v1.20.9+vmware.1
gc-workers-ztr5c-6f4b555879-r29d5   Ready    <none>                 458d   v1.20.9+vmware.1

❯ kgtkca | grep jtimothy-napp01
jtimothy-napp01     gc     running      2021-07-29T16:59:34Z   v1.20.9+vmware.1-tkg.1.a4cee5b     3     3

Hope it was useful. Cheers!

Sunday, October 9, 2022

Working with Kubernetes using Python - Part 06 - Create namespace

Following code snipet uses Python client for the kubernetes API to create namespace. You will need to specify the kubeconfig file and the context to use for creating the namespace. This is an example case if you are working with multiple kubeconfig files where multiple K8s clusters could be present in each kubeconfig file.

from kubernetes import client, config
import argparse


def load_kubeconfig(kubeconfig_file, context_name):
    try:
        config.load_kube_config(
            config_file=f"{kubeconfig_file}", context=f"{context_name}"
        )
    except config.ConfigException as err:
        print(err)
        raise Exception("Could not configure kubernetes python client!")
    v1 = client.CoreV1Api()
    return v1


def create_ns(v1, ns_name):
    print("Creating namespace")
    namespace = client.V1Namespace(metadata={"name": ns_name})
    ret = v1.create_namespace(namespace)
    print(ret)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    parser.add_argument("-f", "--file", required=True, help="Kubeconfig file")
    args = parser.parse_args()

    context = args.context

    v1 = load_kubeconfig(args.file, context)

    ns_name = input("Enter namespace name: ")
    create_ns(v1, ns_name)


if __name__ == "__main__":
    main()

Following is sample output:

❯ python3 create_namespace.py -c tkc-admin@tkc -f /Users/vineethac/testing/ccs/tkc.kubeconfig
Enter namespace name: vineethac-test11
Creating namespace
{'api_version': 'v1',
 'kind': 'Namespace',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': datetime.datetime(2022, 12, 7, 11, 57, 17, tzinfo=tzutc()),
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': {'kubernetes.io/metadata.name': 'vineetha-test11'},
              'managed_fields': [{'api_version': 'v1',
                                  'fields_type': 'FieldsV1',
                                  'fields_v1': {'f:metadata': {'f:labels': {'.': {},
                                                                            'f:kubernetes.io/metadata.name': {}}}},
                                  'manager': 'OpenAPI-Generator',
                                  'operation': 'Update',
                                  'time': datetime.datetime(2022, 12, 7, 11, 57, 17, tzinfo=tzutc())}],
              'name': 'vineethac-test11',
              'namespace': None,
              'owner_references': None,
              'resource_version': '5518430',
              'self_link': None,
              'uid': '0e9f1211-e09f-4d2d-b475-8995bb0c0907'},
 'spec': {'finalizers': ['kubernetes']},
 'status': {'conditions': None, 'phase': 'Active'}}

Sunday, September 25, 2022

Working with Kubernetes using Python - Part 05 - Get pods

Following code snipet uses Python client for the kubernetes API to get all pods and pods under a specific namespace for a given context:

from kubernetes import client, config
import argparse


def load_kubeconfig(context_name):
    config.load_kube_config(context=f"{context_name}")
    v1 = client.CoreV1Api()
    return v1


def get_all_pods(v1):
    print("Listing all pods:")
    ret = v1.list_pod_for_all_namespaces(watch=False)
    for i in ret.items:
        print(i.metadata.namespace, i.metadata.name, i.status.phase)


def get_namespaced_pods(v1, ns):
    print(f"Listing all pods under namespace {ns}:")
    ret = v1.list_namespaced_pod(f"{ns}")
    for i in ret.items:
        print(i.metadata.namespace, i.metadata.name, i.status.phase)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    parser.add_argument("-n", "--namespace", required=False, help="K8s namespace")
    args = parser.parse_args()

    context = args.context
    v1 = load_kubeconfig(context)

    if not args.namespace:
        get_all_pods(v1)
    else:
        get_namespaced_pods(v1, args.namespace)


if __name__ == "__main__":
    main()

Saturday, June 11, 2022

Working with Kubernetes using Python - Part 04 - Get namespaces

Following code snipet uses Python client for the kubernetes API to get namespace details from a given context:

from kubernetes import client, config
import argparse


def load_kubeconfig(context_name):
    config.load_kube_config(context=f"{context_name}")
    v1 = client.CoreV1Api()
    return v1


def get_all_namespace(v1):
    print("Listing namespaces with their creation timestamp, and status:")
    ret = v1.list_namespace()
    for i in ret.items:
        print(i.metadata.name, i.metadata.creation_timestamp, i.status.phase)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--context", required=True, help="K8s context")
    args = parser.parse_args()

    context = args.context
    v1 = load_kubeconfig(context)
    get_all_namespace(v1)


if __name__ == "__main__":
    main()

Friday, April 8, 2022

Working with Kubernetes using Python - Part 03 - Get nodes

Following code snipet uses kubeconfig python module to switch context and Python client for the kubernetes API to get cluster node details. It takes the default kubeconfig file, and switch to the required context, and get node info of the respective cluster.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>
kubectl get nodes -o json

Code:

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/
https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Saturday, March 19, 2022

Working with Kubernetes using Python - Part 02 - Switch context

Following code snipet uses kubeconfig python module and it takes the default kubeconfig file, and switch to a new context.

kubectl commands:

kubectl config get-contexts
kubectl config current-context
kubectl config use-context <context_name>

Code:

Note: If you want to use a specific kubeconfig file, instead of conf = KubeConfig()

you can use conf = KubeConfig('path-to-your-kubeconfig')

Reference:

https://kubeconfig-python.readthedocs.io/en/latest/

Hope it was useful. Cheers!

Friday, March 4, 2022

Working with Kubernetes using Python - Part 01 - List contexts

Following code snipet uses Python client for the kubernetes API and it takes the default kubeconfig file, list the contexts, and active context.

kubectl commands:

kubectl config get-contexts
kubectl config current-context

Code:

Note: If you want to use a specific kubeconfig file, instead of

contexts, active_context = config.list_kube_config_contexts()

you can use

contexts, active_context = config.list_kube_config_contexts(config_file="path-to-kubeconfig")

Reference:

https://github.com/kubernetes-client/python

Hope it was useful. Cheers!

Sunday, January 30, 2022

vSphere with Tanzu using NSX-T - Part14 - Testing TKC storage using kubestr

In the previous posts we discussed the following:

Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster
Part11 - Troubleshooting TKC
Part12 - Deploy application on TKC and access it
Part13 - Export WCP admin kubeconfig

This article is about using kubestr to test storage options of Tanzu Kubernetes Cluster (TKC). Following are the steps to install kubestr on MAC:

wget https://github.com/kastenhq/kubestr/releases/download/v0.4.31/kubestr_0.4.31_MacOS_amd64.tar.gz
tar -xvf kubestr_0.4.31_MacOS_amd64.tar.gz
chmod +x kubestr
mv kubestr /usr/local/bin

Now, lets do kubestr help.

% kubestr help
kubestr is a tool that will scan your k8s cluster
       and validate that the storage systems in place as well as run
       performance tests.

Usage:
kubestr [flags]
kubestr [command]

Available Commands:
browse      Browse the contents of a CSI PVC via file browser
csicheck    Runs the CSI snapshot restore check
fio         Runs an fio test
help        Help about any command

Flags:
-h, --help             help for kubestr
-e, --outfile string   The file where test results will be written
-o, --output string    Options(json)

Use "kubestr [command] --help" for more information about a command.

I am going to use the following TKC for testing.

% KUBECONFIG=gc.kubeconfig kubectl get nodes
NAME                               STATUS   ROLES                  AGE    VERSION
gc-control-plane-pwngg             Ready    control-plane,master   103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs   Ready    <none>                 103d   v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n   Ready    <none>                 103d   v1.20.9+vmware.1

Let's run kubestr against the cluster now.

% KUBECONFIG=gc.kubeconfig kubestr

**************************************
_ ___   _ ___ ___ ___ _____ ___
| |/ / | | | _ ) __/ __|_   _| _ \
| ' <| |_| | _ \ _|\__ \ | | |   /
|_|\_\\___/|___/___|___/ |_| |_|_\

Explore your Kubernetes storage options
**************************************
Kubernetes Version Check:
Valid kubernetes version (v1.20.9+vmware.1) - OK

RBAC Check:
Kubernetes RBAC is enabled - OK

Aggregated Layer Check:
The Kubernetes Aggregated Layer is enabled - OK

W0130 14:17:16.937556   87541 warnings.go:70] storage.k8s.io/v1beta1 CSIDriver is deprecated in v1.19+, unavailable in v1.22+; use storage.k8s.io/v1 CSIDriver
Available Storage Provisioners:

csi.vsphere.xxxx.com:
    Can't find the CSI snapshot group api version.
    This is a CSI driver!
    (The following info may not be up to date. Please check with the provider for more information.)
    Provider:            vSphere
    Website:             https://github.com/kubernetes-sigs/vsphere-csi-driver
    Description:         A Container Storage Interface (CSI) Driver for VMware vSphere
    Additional Features: Raw Block,<br/><br/>Expansion (Block Volume),<br/><br/>Topology Aware (Block Volume)

    Storage Classes:
      * sc2-01-vc16c01-wcp-mgmt

    To perform a FIO test, run-
      ./kubestr fio -s <storage class>

You can run storage tests using kubestr and it uses FIO for generating IOs. For example this is how you can run a basic storage test.

% KUBECONFIG=gc.kubeconfig kubestr fio -s sc2-01-vc16c01-wcp-mgmt -z 10G
PVC created kubestr-fio-pvc-zvdhr
Pod created kubestr-fio-pod-kdbs5
Running FIO test (default-fio) on StorageClass (sc2-01-vc16c01-wcp-mgmt) with a PVC of Size (10G)
Elapsed time- 29.290421119s
FIO test results:

FIO version - fio-3.20
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=3987.150391 BW(KiB/s)=15965
iops: min=3680 max=4274 avg=3992.034424
bw(KiB/s): min=14720 max=17096 avg=15968.827148

JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=3562.628906 BW(KiB/s)=14267
iops: min=3237 max=3750 avg=3565.896484
bw(KiB/s): min=12950 max=15000 avg=14264.862305

JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=2988.549316 BW(KiB/s)=383071
iops: min=2756 max=3252 avg=2992.344727
bw(KiB/s): min=352830 max=416256 avg=383056.187500

JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=2754.796143 BW(KiB/s)=353151
iops: min=2480 max=2992 avg=2759.586182
bw(KiB/s): min=317440 max=382976 avg=353242.781250

Disk stats (read/write):
sdd: ios=117160/105647 merge=0/1210 ticks=2100090/2039676 in_queue=4139076, util=99.608589%
- OK

As you can see, a PVC of 10G, a FIO pod will be created, and this will be used for the FIO test. Once the test is complete, the PVC and FIO pod will be deleted automatically.

I hope it was useful. Cheers!

Pages

Monday, July 1, 2024

Saturday, June 22, 2024

Saturday, May 25, 2024

The Issue

Troubleshooting steps

Saturday, November 18, 2023

Deploy using Helm

Verify

Saturday, July 8, 2023

Sunday, May 7, 2023

Saturday, April 8, 2023

IMPORTANT NOTES:

Symptom:

Troubleshooting:

Workaround:

Verify:

References:

Saturday, February 4, 2023

Sunday, November 13, 2022

Sunday, October 9, 2022

Sunday, September 25, 2022

Saturday, June 11, 2022

Friday, April 8, 2022

Saturday, March 19, 2022

Friday, March 4, 2022

Sunday, January 30, 2022