vineethac.blogspot.com: tanzu

Showing posts with label tanzu. Show all posts

Thursday, July 4, 2024

vSphere with Tanzu using NSX-T - Part35 - Monitoring supervisor cluster health with Python and vCenter APIs

vSphere with Tanzu Supervisor cluster is a Kubernetes platform that simplifies the deployment, management, and scaling of Kubernetes clusters. Monitoring the health of your WCP/ Supervisor clusters is crucial to ensure the smooth running of your Tanzu Kubernetes Clusters (TKCs) and applications. In this blog post, we'll explore how to use Python and vCenter APIs to verify the health of your Supervisor clusters.

You can access the Python script from my GitHub repository: https://github.com/vineethac/VMware/tree/main/vSphere_with_Tanzu/wcp_cluster_health

This script connects to the vCenter server, retrieves the cluster summary, and checks the Tanzu Supervisor cluster configuration info and prints the status of the cluster. By using this Python script, you can easily monitor the health of your Tanzu Supervisor clusters through vCenter APIs.

Hope it was useful. Cheers!

Monday, July 1, 2024

vSphere with Tanzu using NSX-T - Part34 - CPU and Memory utilization of a supervisor cluster

vSphere with Tanzu is a Kubernetes-based platform for deploying and managing containerized applications. As with any cloud-native platform, it's essential to monitor the performance and utilization of the underlying infrastructure to ensure optimal resource allocation and avoid any potential issues. In this blog post, we'll explore a Python script that can be used to check the CPU and memory allocation/ usage of a WCP Supervisor cluster.

You can access the Python script from my GitHub repository: https://github.com/vineethac/VMware/tree/main/vSphere_with_Tanzu/wcp_cluster_util

Sample screenshot of the output

The script uses the Kubernetes Python client library (kubernetes) to connect to the Supervisor cluster using the admin kubeconfig and retrieve information about the nodes and their resource utilization. The script then calculates the average CPU and memory utilization across all nodes and prints the results to the console.

Note: In my case instead of running it as a script every time, I made it an executable plugin and copied it to the system executable path. I placed it in $HOME/.krew/bin in my laptop.

Hope it was useful. Cheers!

Thursday, June 27, 2024

vSphere with Tanzu using NSX-T - Part33 - Troubleshooting intermittent connection timeouts to apiserver and workloads

In the realm of managing Tanzu Kubernetes clusters (TKCs), we have encountered several challenges that hindered the smooth functioning of our applications. In this blog post, we will discuss three such cases and the workarounds we employed to resolve them.

Case 1: TKC Control Plane Node Connectivity Issues

Symptoms:

TKC apiserver connection timeouts when attempting to connect using the kubeconfig.
Traffic was not flowing to two of the control plane nodes.
NSX-T web UI LB VS stats indicated this issue.

Case 2: TKC Worker Node Connectivity Issues

Symptoms:

Workload (example: PostgreSQL cluster) connection timeouts.
Traffic was not flowing to two of the worker nodes in the TKC.
NSX-T web UI LB VS stats indicated this issue.

Case 3: Load Balancer Connectivity Issues

Symptoms:

Connection timeouts when attempting to connect to a PostgreSQL workload through the load balancer VS IP.
This issue was observed only when creating new services of type LoadBalancer in the TKC.
We noticed datapath mempool usage for the edge nodes was above the threshold value.

Resolution/ work around

Find the T1 router that is attached to the TKC which has connectivity issues.
In an Active - Standby HA configuration, you will see that there will be one Edge node that will be Active and another one in Standby status.
First place the Standby Edge node in NSX MM, reboot it, and then exit it from NSX MM.
Now, place the Active Edge node in NSX MM, there will be a slight network disruption during this failover, once it is in NSX MM, reboot it, and then exit NSX MM.
This should resolved the issue.

In conclusion, these cases illustrate the importance of verifying NSX-T components when managing Tanzu Kubernetes clusters. By identifying the root cause of the issues and employing effective workarounds, we were able to restore functionality and maintain the health of our applications. Stay tuned for more insights and best practices in managing Kubernetes clusters.

Hope it was useful. Cheers!

Wednesday, June 26, 2024

vSphere with Tanzu using NSX-T - Part32 - Troubleshooting BGP related issues

This article provides basic guidance on troubleshooting BGP related issues.

Sample diagram showing connectivity between Edge Nodes and TOR switches

Verify Tier-0 Gateway status on NSX-T

Status of T0 should be Success.

Check the interfaces of T0 to identify which all edge nodes are part of it.

Check the status of Edge Transport Nodes.

As you can see from the T0 interfaces, Edge01/02/03/04 are part of it and in those edge nodes you should be able to see the SR_TIER0 component. Next step is to login to those Edge nodes that are part of T0 and verify BGP summary.

Verify BGP on all Edge nodes that are part of T0 Gateway

SSH into the edge node as admin user.

get logical-router

Look for SERVICE_ROUTER_TIER0.

sc2-01-nsxt04-r08edge02> get logical-router
Logical Router
UUID                                   VRF    LR-ID  Name                              Type                        Ports   Neighbors
736a80e3-23f6-5a2d-81d6-bbefb2786666   0      0                                        TUNNEL                      4       22/5000
e6d02207-c51e-4cf8-81a6-44afec5ad277   2      84653  DR-t1-domain-c1034:1de3adfa-0ee   DISTRIBUTED_ROUTER_TIER1    5       9/50000
a590f1da-2d79-4749-8153-7b174d23b069   32     85271  DR-t1-domain-c1034:1de3adfa-0ee   DISTRIBUTED_ROUTER_TIER1    5       5/50000
758d9736-6781-4b3a-906f-3d1b03f0924d   33     88016  DR-t1-domain-c1034:1de3adfa-0ee   DISTRIBUTED_ROUTER_TIER1    4       1/50000
5e7bfe98-0b5e-4620-90b1-204634e99127   37     3      SR-sc2-01-nsxt04-tr               SERVICE_ROUTER_TIER0        6       5/50000

vrf <SERVICE_ROUTER_TIER0 VRF>

get bgp neighbor summary

Note: If everything is working fine State should show Estab.

sc2-01-nsxt04-r08edge02> vrf 37
sc2-01-nsxt04-r08edge02(tier0_sr[37])> get bgp neighbor summary
BFD States: NC - Not configured, DC - Disconnected
            AD - Admin down, DW - Down, IN - Init, UP - Up
BGP summary information for VRF default for address-family: ipv4Unicast
Router ID: 10.184.248.2  Local AS: 4259971071

Neighbor                            AS          State Up/DownTime  BFD InMsgs  OutMsgs InPfx  OutPfx

10.184.248.239                      4259970544  Estab 05w1d22h     NC  12641393 12610093 2      568
10.184.248.240                      4259970544  Estab 05w1d23h     NC  12640337 11580431 2      566

You should be able to ping to the BGP neighbor IP. If you are unable to ping to neighbor IPs, then there is an issue.

sc2-01-nsxt04-r08edge02(tier0_sr[37])> ping 10.184.248.239
PING 10.184.248.239 (10.184.248.239): 56 data bytes
64 bytes from 10.184.248.239: icmp_seq=0 ttl=255 time=1.788 ms
^C
--- 10.184.248.239 ping statistics ---
2 packets transmitted, 1 packets received, 50.0% packet loss
round-trip min/avg/max/stddev = 1.788/1.788/1.788/0.000 ms

sc2-01-nsxt04-r08edge02(tier0_sr[37])> ping 10.184.248.240
PING 10.184.248.240 (10.184.248.240): 56 data bytes
64 bytes from 10.184.248.240: icmp_seq=0 ttl=255 time=1.925 ms
64 bytes from 10.184.248.240: icmp_seq=1 ttl=255 time=1.251 ms
^C
--- 10.184.248.240 ping statistics ---
3 packets transmitted, 2 packets received, 33.3% packet loss
round-trip min/avg/max/stddev = 1.251/1.588/1.925/0.337 ms

Get interfaces | more

sc2-01-nsxt04-r08edge02> vrf 37
sc2-01-nsxt04-r08edge02(tier0_sr[37])> get interfaces | more
Fri Aug 19 2022 UTC 11:07:18.042
Logical Router
UUID                                   VRF    LR-ID  Name                              Type
5e7bfe98-0b5e-4620-90b1-204634e99127   37     3      SR-sc2-01-nsxt04-tr               SERVICE_ROUTER_TIER0
Interfaces (IPv6 DAD Status A-DAD_Success, F-DAD_Duplicate, T-DAD_Tentative, U-DAD_Unavailable)
    Interface     : dd83554d-47c0-5a4e-9fbe-3abb1239a071
    Ifuid         : 335
    Mode          : cpu
    Port-type     : cpu
    Enable-mcast  : false

    Interface     : 008b2b15-17d1-4cc8-9d94-d9c4c2d0eb3a
    Ifuid         : 1000
    Name          : tr-interconnect-edge02
    Fwd-mode      : IPV4_AND_IPV6
    Internal name : uplink-1000
    Mode          : lif
    Port-type     : uplink
    IP/Mask       : 10.184.248.2/24
    MAC           : 02:00:70:51:9d:79
    VLAN          : 1611

Verify BGP on Cisco TOR switches

SSH to TOR switch.

show ip bgp summary

❯ ssh -o PubkeyAuthentication=no netadmin@sc2-01-r08lswa.xxxxxxxx.com
User Access Verification
(netadmin@sc2-01-r08lswa.xxxxxxxx.com) Password:

Cisco Nexus Operating System (NX-OS) Software

sc2-01-r08lswa# show ip bgp summary
BGP summary information for VRF default, address family IPv4 Unicast
BGP router identifier 10.184.17.248, local AS number 65001.65008
BGP table version is 520374, IPv4 Unicast config peers 10, capable peers 8
5150 network entries and 11372 paths using 2003240 bytes of memory
BGP attribute entries [110/18920], BGP AS path entries [69/1430]
BGP community entries [0/0], BGP clusterlist entries [0/0]
11356 received paths for inbound soft reconfiguration
11356 identical, 0 modified, 0 filtered received paths using 0 bytes

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.184.10.14    4 65011.65000
                        47979514 10570342   520374    0    0     5w1d 4541
10.184.10.78    4 65011.65000
                        47814555 10601750   520374    0    0     5w1d 4541
10.184.248.1    4 65001.65535
                          80831   79447   520374    0    0 02:41:51 566
10.184.248.2    4 65001.65535
                        3215614 3269391   520374    0    0     5w1d 566
10.184.248.3    4 65001.65535
                        3215776 3269344   520374    0    0     1w3d 566
10.184.248.4    4 65001.65535
                        3215676 3269383   520374    0    0 13:51:45 566
10.184.248.5    4 65001.65535
                        3200531 3269384   520374    0    0     5w1d 5
10.184.248.6    4 65001.65535
                        3197752 3266700   520374    0    0     5w1d 5

show ip arp

sc2-01-r08lswa# show ip arp 10.184.248.2

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       CP - Added via L2RIB, Control plane Adjacencies
       PS - Added via L2RIB, Peer Sync
       RO - Re-Originated Peer Sync Entry
       D - Static Adjacencies attached to down interface

IP ARP Table
Total number of entries: 1
Address         Age       MAC Address     Interface       Flags
10.184.248.2    00:06:12  0200.7051.9d79  Vlan1611

If you compare this IP and MAC, you can see that its the same of your T0 SR uplink of your edge02 node.

IP/Mask       : 10.184.248.2/24
MAC           : 02:00:70:51:9d:79

For further troubleshooting you can do packet capture from the edge nodes and ESXi server and analyze them using Wireshark.

Packet capture from Edge node

Capture packets from the T0 SR uplink interface.

sc2-01-nsxt04-r08edge01(tier0_sr[5])> get interfaces | more
Wed Aug 17 2022 UTC 13:52:48.203
Logical Router
UUID                                   VRF    LR-ID  Name                              Type
fb1ad846-8757-4fdf-9cbb-5c22ba772b52   5      2      SR-sc2-01-nsxt04-tr               SERVICE_ROUTER_TIER0
Interfaces (IPv6 DAD Status A-DAD_Success, F-DAD_Duplicate, T-DAD_Tentative, U-DAD_Unavailable)
    Interface     : c8b80ba1-93fc-5c82-a44f-4f4863b6413c
    Ifuid         : 286
    Mode          : cpu
    Port-type     : cpu
    Enable-mcast  : false

    Interface     : 4915d978-9c9a-58bc-84e2-cafe5442cba4
    Ifuid         : 287
    Mode          : blackhole
    Port-type     : blackhole

    Interface     : 899bcf30-83e2-46bb-9be2-8889ec52b354
    Ifuid         : 833
    Name          : tr-interconnect-edge01
    Fwd-mode      : IPV4_AND_IPV6
    Internal name : uplink-833
    Mode          : lif
    Port-type     : uplink
    IP/Mask       : 10.184.248.1/24
    MAC           : 02:00:70:d1:92:b1
    VLAN          : 1611
    Access-VLAN   : untagged
    LS port       : 15b971e9-7caa-43b7-86c1-96ff50453402
    Urpf-mode     : STRICT_MODE
    DAD-mode      : LOOSE
    RA-mode       : SLAAC_DNS_TRHOUGH_RA(M=0, O=0)
    Admin         : up
    Op_state      : up
    Enable-mcast  : False
    MTU           : 9000
    arp_proxy     :

Start a continuous ping from the TOR switches to the edge uplink IP (in this case ping 10.184.248.1 from TOR switches) before starting packet capture.

sc2-01-nsxt04-r08edge01> start capture interface 899bcf30-83e2-46bb-9be2-8889ec52b354 file uplink.pcap

Note:
Find the location of uplink.pcap file on TOR switches and SCP it locally to analyze using Wireshark.

Packet capture from ESXi

In this example, we are capturing packets of sc2-01-nsxt04-r08edge01 VM from the switchports where its interfaces are connected. sc2-01-nsxt04-r08edge01 VM is running on ESXi node sc2-01-r08esx10.

[root@sc2-01-r08esx10:~] esxcli network vm list | grep edge
18790721  sc2-01-nsxt04-r08edge05                                                 3  , ,
18977245  sc2-01-nsxt04-r08edge01                                                 3  , ,

[root@sc2-01-r08esx10:/tmp] esxcli network vm port list -w 18977245
   Port ID: 67109446
   vSwitch: sc2-01-vc16-dvs
   Portgroup:
   DVPort ID: b60a80c0-ecd6-40bd-8d2b-fbd1f06bb172
   MAC Address: 02:00:70:33:a9:67
   IP Address: 0.0.0.0
   Team Uplink: vmnic1
   Uplink Port ID: 2214592517
   Active Filters:

   Port ID: 67109447
   vSwitch: sc2-01-vc16-dvs
   Portgroup:
   DVPort ID: 6e3d8057-fc23-4180-b0ba-bed90381f0bf
   MAC Address: 02:00:70:d1:92:b1
   IP Address: 0.0.0.0
   Team Uplink: vmnic1
   Uplink Port ID: 2214592517
   Active Filters:

   Port ID: 67109448
   vSwitch: sc2-01-vc16-dvs
   Portgroup:
   DVPort ID: c531df19-294d-4079-b39c-89a3b58e30ad
   MAC Address: 02:00:70:30:c7:01
   IP Address: 0.0.0.0
   Team Uplink: vmnic0
   Uplink Port ID: 2214592519
   Active Filters:

Start a continuous ping from the TOR switches to the edge uplink IP (in this case ping 10.184.248.1 from TOR switches) before starting packet capture.

[root@sc2-01-r08esx10:/tmp] pktcap-uw --switchport 67109446 --dir 2 -o /tmp/67109446-02:00:70:33:a9:67.pcap --count 1000 & pktcap-uw --switchport 67109447 --dir 2 -o /tmp/67109447-02:00:70:d1:92:b1.pcap --count 1000 & pktcap-uw --switchport 67109448 --dir 2 -o /tmp/67109448-02:00:70:30:c7:01.pcap --count 1000

Note:
SCP the pcap files to laptop and use Wireshark to analyse them.
You can also do packet capture from physical uplinks (vmnic) of the ESXi node if required.

Hope it was useful. Cheers!

Sunday, June 23, 2024

vSphere with Tanzu using NSX-T - Part31 - Troubleshooting inaccessible TKC with expired control plane certs

In the course of managing multiple Tanzu Kubernetes Clusters (TKC), I encountered an unexpected issue: the control plane certificates had expired, preventing us from accessing the cluster using the kubeconfig file. To make matters worse, we were unable to SSH into the TKC control plane Virtual Machines (VMs) due to the vmware-system-user password expiring in accordance with STIG Hardening.

The recommended workaround for updating the vmware-system-user password expiry involves applying a specific daemonset on Guest Clusters. However, this approach requires access to the TKC using its admin kubeconfig file, which was unavailable due to the expired certificates.

Warning: In case of critical production issues that affect the accessibility of your Tanzu Kubernetes Cluster (TKC), it is strongly advised to submit a product support request to our team for assistance. This will ensure that you receive expert guidance and a timely resolution to help minimize the impact on your environment.

To resolve this issue, I followed an alternative workaround: I reset the root password of the TKC control plane VMs through the vCenter VM console, as outlined in this knowledge base article. Once the root password was reset, I was able to log directly into the TKC control plane VM using the VM console.

After gaining access to the TKC control plane VM, I proceeded to renew the control plane certificates using kubeadm, as detailed in this blog post. It's essential to apply this process to all control plane nodes in your cluster to ensure proper functionality.

root [ /etc/kubernetes ]# kubeadm certs check-expiration

root [ /etc/kubernetes ]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Although this workaround required some additional steps, it ultimately allowed us to regain access to our Tanzu Kubernetes Cluster and maintain its security and functionality.

Hope it was useful. Cheers!

Saturday, May 25, 2024

vSphere with Tanzu using NSX-T - Part30 - Troubleshooting inaccessible TKC with server pool members missing in the LB VS

Encountering issues with connectivity to your TKC apiserver/ control plane can be frustrating. One common problem we've seen is the kubeconfig failing to connect, often due to missing server pool members in the load balancer's virtual server (LB VS).

The Issue

The LB VS, which operates on port 6443, should have the control plane VMs listed as its member servers. When these members are missing, connectivity problems arise, disrupting your access to the TKC apiserver.

Troubleshooting steps

Access the TKC: Use the kubeconfig to access the TKC.

❯ KUBECONFIG=tkc.kubeconfig kubectl get node
Unable to connect to the server: dial tcp 10.191.88.4:6443: i/o timeout
❯

Check the Load Balancer: In NSX-T, verify the status of the corresponding load balancer (LB). It may display a green status indicating success.
Inspect Virtual Servers: Check the virtual servers in the LB, particularly on port 6443. They might show as down.
Examine Server Pool Members: Look into the server pool members of the virtual server. You may find it empty.
SSH to Control Plane Nodes: Attempt to SSH into the TKC control plane nodes.

Run Diagnostic Commands: Execute diagnostic commands inside the control plane nodes to verify their status. The issue could be that the control plane VMs are in a hung state, and the container runtime is not running.

vmware-system-user@tkc-infra-r68zc-jmq4j [ ~ ]$ sudo su
root [ /home/vmware-system-user ]# crictl ps
FATA[0002] failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl is-active containerd
Failed to retrieve unit state: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root [ /home/vmware-system-user ]#
root [ /home/vmware-system-user ]# systemctl status containerd
WARNING: terminal is not fully functional
-  (press RETURN)Failed to get properties: Failed to activate service 'org.freedesktop.systemd1'>
lines 1-1/1 (END)lines 1-1/1 (END)

Check VM Console: From vCenter, check the console of the control plane VMs. You might see specific errors indicating issues.

EXT4-fs (sda3): Delayed block allocation failed for inode 266704 at logical offset 10515 with max blocks 2 with error 5
EXT4-fs (sda3): This should not happen!! Data will be lost
EXT4-fs error (device sda3) in ext4_writepages:2905: IO failure
EXT4-fs error (device sda3) in ext4_reserve_inode_write:5947: Journal has aborted
EXT4-fs error (device sda3) xxxxxx-xxx-xxxx: unable to read itable block
EXT4-fs error (device sda3) in ext4_journal_check_start:61: Detected aborted journal
systemd[1]: Caught <BUS>, dumped core as pid 24777.
systemd[1]: Freezing execution.

Restart Control Plane VMs: Restart the control plane VMs. Note that sometimes your admin credentials or administrator@vsphere.local credentials may not allow you to restart the TKC VMs. In such cases, decode the username and password from the relevant secret and use these credentials to connect to vCenter and restart the hung TKC VMs.

❯ kubectx wdc-01-vc17
Switched to context "wdc-01-vc17".
❯
❯ kg secret -A | grep wcp
kube-system                                 wcp-authproxy-client-secret                                               kubernetes.io/tls                                  3      291d
kube-system                                 wcp-authproxy-root-ca-secret                                              kubernetes.io/tls                                  3      291d
kube-system                                 wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-nsop                          wcp-nsop-sa-vc-auth                                                       Opaque                                             2      291d
vmware-system-nsx                           wcp-cluster-credentials                                                   Opaque                                             2      291d
vmware-system-vmop                          wcp-vmop-sa-vc-auth                                                       Opaque                                             2      291d
❯
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth
NAME                  TYPE     DATA   AGE
wcp-vmop-sa-vc-auth   Opaque   2      291d
❯ kg secrets -n vmware-system-vmop wcp-vmop-sa-vc-auth -oyaml
apiVersion: v1
data:
  password: aWAmbHUwPCpKe1Uxxxxxxxxxxxx=
  username: d2NwLXZtb3AtdXNlci1kb21haW4tYzEwMDYtMxxxxxxxxxxxxxxxxxxxxxxxxQHZzcGhlcmUubG9jYWw=
kind: Secret
metadata:
  creationTimestamp: "2022-10-24T08:32:26Z"
  name: wcp-vmop-sa-vc-auth
  namespace: vmware-system-vmop
  resourceVersion: "336557268"
  uid: dcbdac1b-18bb-438c-ba11-76ed4d6bef63
type: Opaque
❯

***Decrypt the username and password from the secret and use it to connect to the vCenter.
***Following is an example using PowerCLI:

PS /Users/vineetha> get-vm gc-control-plane-f266h

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VMGuest
Restart-VMGuest: 08/04/2023 22:20:20	Restart-VMGuest		Operation "Restart VM guest" failed for VM "gc-control-plane-f266h" for the following reason: A general system error occurred: Invalid fault
PS /Users/vineetha>
PS /Users/vineetha> get-vm gc-control-plane-f266h | Restart-VM

Confirm
Are you sure you want to perform this action?
Performing the operation "Restart-VM" on target "VM 'gc-control-plane-f266h'".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): Y

Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
gc-control-plane-f2… PoweredOn  2        4.000

PS /Users/vineetha>

Verify System Pods and Connectivity: Once the control plane VMs are restarted, the system pods inside them will start, and the apiserver will become accessible using the kubeconfig. You should also see the previously missing server pool members reappear in the corresponding LB virtual server, and the virtual server on port 6443 will be up and show a success status.

Following these steps should help you resolve the connectivity issues with your TKC apiserver/control plane effectively.Ensuring that your load balancer's virtual server is correctly configured with the appropriate member servers is crucial for maintaining seamless access. This runbook aims to guide you through the process, helping you get your TKC apiserver back online swiftly.

Note: If required for critical production issues related to TKC accessibility I strongly recommend to raise a product support request.

Hope it was useful. Cheers!

Saturday, April 20, 2024

Hugging Face - Part5 - Deploy your LLM app on Kubernetes

In our previous blog post, we explored the process of containerizing the Large Language Model (LLM) from Hugging Face using FastAPI and Docker. The next step is deploying this containerized application on a Kubernetes cluster. Additionally, I'll share my observations and insights gathered during this exercise.

You can access the deployment yaml spec and detailed instructions in my GitHub repo:

https://github.com/vineethac/huggingface/tree/main/6-deploy-on-k8s

Requirements

I am using a Tanzu Kubernetes Cluster (TKC).

Each node is of size best-effort-2xlarge which has 8 vCPU and 64Gi of memory.

❯ KUBECONFIG=gckubeconfig k get node
NAME                                             STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-49jx4                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-control-plane-m8wmt                        Ready    control-plane,master   105d   v1.23.8+vmware.3
tkc01-control-plane-z6gxx                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-8gjn8   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-c9nfq   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-cngff   Ready    <none>                 21d    v1.23.8+vmware.3
❯

I've attached 256Gi storage volumes to the worker nodes that is mounted at /var/lib/containerd. The worker nodes on which these llm pods are running should have enough storage space. Otherwise you may notice these pods getting stuck/ restarting/ unknownstatus. If the worker nodes run out of the storage disk space, you will see pods getting evicted with warnings The node was low on resource: ephemeral-storage. TKC spec is available in the above mentioned Git repo.

Deployment

The deployment and service yaml spec are given in fastapi-llm-app-deploy-cpu.yaml.

This works on a CPU powered Kubernetes cluster. Additional configurations might be required if you want to run this on a GPU powered cluster.

We have already instrumented the Readiness and Liveness functionality in the LLM app itself.

The readiness probe invokes the /healthz endpoint exposed by the FastAPI app. This will make sure the FastAPI itself is healthy/ responding to the API calls.

The liveness probe invokes liveness.py script within the app. The script invokes the /ask endpoint which interacts with the LLM and returns the response. This will make sure the LLM is responding to the user queries. For some reason if the llm is not responding/ hangs, the liveness probe will fail and eventually it will restart the container.

You can apply the deployment yaml spec as follows:

❯ KUBECONFIG=gckubeconfig k apply -f fastapi-llm-app-deploy-cpu.yaml

Validation

❯ KUBECONFIG=gckubeconfig k get deploy fastapi-llm-app
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
fastapi-llm-app   2/2     2            2           21d
❯
❯ KUBECONFIG=gckubeconfig k get pods | grep fastapi-llm-app
fastapi-llm-app-758c7c58f7-79gmq                               1/1     Running   1 (71m ago)    13d
fastapi-llm-app-758c7c58f7-gqdc6                               1/1     Running   1 (99m ago)    13d
❯
❯ KUBECONFIG=gckubeconfig k get svc fastapi-llm-app
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
fastapi-llm-app   LoadBalancer   10.110.228.33   10.216.24.104   5000:30590/TCP   5h24m
❯

Now you can just do a curl against the EXTERNAL-IP of the above mentioned fastapi-llm-app service.

❯ curl http://10.216.24.104:5000/ask -X POST -H "Content-Type: application/json" -d '{"text":"list comprehension examples in python"}'

In our next blog post, we'll try enhancing our FastAPI application with robust instrumentation. Specifically, we'll explore the process of integrating FastAPI metrics into our application, allowing us to gain valuable insights into its performance and usage metrics. Furthermore, we'll take a look at incorporating traces using OpenTelemetry, a powerful tool for distributed tracing and observability in modern applications. By leveraging OpenTelemetry, we'll be able to gain comprehensive visibility into the behavior of our application across distributed systems, enabling us to identify performance bottlenecks and optimize resource utilization.

Stay tuned for an insightful exploration of FastAPI metrics instrumentation and OpenTelemetry integration in our upcoming blog post!

Hope it was useful. Cheers!

Monday, January 15, 2024

Ollama - Part1 - Deploy Ollama on Kubernetes

Docker published GenAI stack around Oct 2023 which consists of large language models (LLMs) from Ollama, vector and graph databases from Neo4j, and the LangChain framework. These utilities can help developers with the resources they need to kick-start creating new applications using generative AI. Ollama can be used to deploy and run LLMs locally. In this exercise we will deploy Ollama to a Kubernetes cluster and prompt it.

In my case I am using a Tanzu Kubernetes Cluster (TKC) running on vSphere with Tanzu 7u3 platform powered by Dell PowerEdge R640 servers. The TKC nodes are using best-effort-2xlarge vmclass with 8 CPU and 64Gi Memory. Note that I am running it on a regular Kubernetes cluster without GPU. If you have GPU, additional configuration steps might be required.

Full project in my GitHub

https://github.com/vineethac/Ollama/tree/main/ollama_on_kubernetes

Hope it was useful. Cheers!

Saturday, December 30, 2023

GitOps using Argo CD - Part2 - Mini project

In the previous blog post, we discussed deploying Argo CD on a Kubernetes cluster and explored the fundamentals of application management. This time, we'll leverage Argo CD to deploy the applications from our Kubernetes mini project.

Full project in my GitHub

https://github.com/vineethac/Kubernetes/tree/main/gitops-argocd

Following are the different components of the project that will get deployed on to a Kubernetes cluster using the Argo CD application resource:

Ingress controller
Prometheus stack
FastAPI web app
FastAPI service monitor
Loki stack

Deploy each of these components by applying the corresponding YAML manifest, following the outlined steps in the GitHub repository mentioned above. After the successful deployment of all components, you can observe them in the Argo CD web UI, as illustrated below.

Hope it was useful. Cheers!

Sunday, December 3, 2023

Kubernetes mini project

In this mini project, we are going to learn the following:

Deploy a simple Python based web application on a Kubernetes cluster.
We will use Helm to deploy this app.
This web app uses FastAPI and exposes some metrics using the Prometheus Python client.
To store and visualize these metrics we will deploy Prometheus and Grafana in the K8s cluster.
We will also deploy and use an ingress controller for exposing the web app, Prometheus, and Grafana to external users.
For logging we will deploy and use Grafana Loki stack.

Full project in my GitHub

https://github.com/vineethac/Kubernetes/tree/main/mini-project

High-level steps to complete this project

Step1: Write the Python app.

Step2: Create the Dockerfile for the app.

Step3: Create the container image.

Step4: Push the container image to an image registry like Docker Hub.

Step5: Get access to a K8s cluster.

Step6: Deploy an ingress controller.

Step7: Create the Helm chart for your app and deploy it to the K8s cluster.

Step8: Deploy Prometheus stack on the K8s cluster using Helm.

Step9: Create a servicemonitor resource which defines the target to be monitored by Prometheus.

Step10: Verify targets and service discovery in Prometheus.

Step11: Configure Grafana dashboard and verify.

Step12. Deploy Grafana Loki stack using Helm.

Hope it was useful. Cheers!

Saturday, November 18, 2023

vSphere with Tanzu using NSX-T - Part29 - Logging using Loki stack

Grafana Loki is a log aggregation system that we can use for Kubernetes. In this post we will deploy Loki stack on a Tanzu Kubernetes cluster.

❯ KUBECONFIG=gc.kubeconfig kg no
NAME                                            STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-k8fzb                       Ready    control-plane,master   144m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-4n5kh   Ready    <none>                 132m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-8pcc6   Ready    <none>                 128m   v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-rx7jf   Ready    <none>                 134m   v1.23.8+vmware.3
❯

❯ helm repo add grafana https://grafana.github.io/helm-charts
❯ helm repo update
❯ helm repo list
❯ helm search repo loki

I saved the values file using helm show values grafana/loki-stack and made necessary modifications as mentioned below.

I enabled Grafana by setting enabled: true. This will create a new Grafana instance.
I also added a section under grafana.ingress in the loki-stack/values.yaml, that will create an ingress resource for this new Grafana instance.

Here is the values.yaml file.

test_pod:
  enabled: true
  image: bats/bats:1.8.2
  pullPolicy: IfNotPresent

loki:
  enabled: true
  isDefault: true
  url: http://{{(include "loki.serviceName" .)}}:{{ .Values.loki.service.port }}
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  livenessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  datasource:
    jsonData: "{}"
    uid: ""


promtail:
  enabled: true
  config:
    logLevel: info
    serverPort: 3101
    clients:
      - url: http://{{ .Release.Name }}:3100/loki/api/v1/push

fluent-bit:
  enabled: false

grafana:
  enabled: true
  sidecar:
    datasources:
      label: ""
      labelValue: ""
      enabled: true
      maxLines: 1000
  image:
    tag: 8.3.5
  ingress:
    ## If true, Grafana Ingress will be created
    ##
    enabled: true

    ## IngressClassName for Grafana Ingress.
    ## Should be provided if Ingress is enable.
    ##
    ingressClassName: nginx

    ## Annotations for Grafana Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    ## Labels to be added to the Ingress
    ##
    labels: {}

    ## Hostnames.
    ## Must be provided if Ingress is enable.
    ##
    # hosts:
    #   - grafana.domain.com
    hosts:
      - grafana-loki-vineethac-poc.test.com

    ## Path for grafana ingress
    path: /

    ## TLS configuration for grafana Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
    # - secretName: grafana-general-tls
    #   hosts:
    #   - grafana.example.com

prometheus:
  enabled: false
  isDefault: false
  url: http://{{ include "prometheus.fullname" .}}:{{ .Values.prometheus.server.service.servicePort }}{{ .Values.prometheus.server.prefixURL }}
  datasource:
    jsonData: "{}"

filebeat:
  enabled: false
  filebeatConfig:
    filebeat.yml: |
      # logging.level: debug
      filebeat.inputs:
      - type: container
        paths:
          - /var/log/containers/*.log
        processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
      output.logstash:
        hosts: ["logstash-loki:5044"]

logstash:
  enabled: false
  image: grafana/logstash-output-loki
  imageTag: 1.0.1
  filters:
    main: |-
      filter {
        if [kubernetes] {
          mutate {
            add_field => {
              "container_name" => "%{[kubernetes][container][name]}"
              "namespace" => "%{[kubernetes][namespace]}"
              "pod" => "%{[kubernetes][pod][name]}"
            }
            replace => { "host" => "%{[kubernetes][node][name]}"}
          }
        }
        mutate {
          remove_field => ["tags"]
        }
      }
  outputs:
    main: |-
      output {
        loki {
          url => "http://loki:3100/loki/api/v1/push"
          #username => "test"
          #password => "test"
        }
        # stdout { codec => rubydebug }
      }

# proxy is currently only used by loki test pod
# Note: If http_proxy/https_proxy are set, then no_proxy should include the
# loki service name, so that tests are able to communicate with the loki
# service.
proxy:
  http_proxy: ""
  https_proxy: ""
  no_proxy: ""

Deploy using Helm

❯ helm upgrade --install --atomic loki-stack grafana/loki-stack --values values.yaml --kubeconfig=gc.kubeconfig --create-namespace --namespace=loki-stack
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: gc.kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: gc.kubeconfig
Release "loki-stack" does not exist. Installing it now.
W1203 13:36:48.286498   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:48.592349   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.840670   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.849356   31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: loki-stack
LAST DEPLOYED: Sun Dec  3 13:36:45 2023
NAMESPACE: loki-stack
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a datasource in Grafana.

See http://docs.grafana.org/features/datasources/loki/ for more detail.

Verify

❯ KUBECONFIG=gc.kubeconfig kg all -n loki-stack
NAME                                     READY   STATUS    RESTARTS   AGE
pod/loki-stack-0                         1/1     Running   0          89s
pod/loki-stack-grafana-dff58c989-jdq2l   2/2     Running   0          89s
pod/loki-stack-promtail-5xmrj            1/1     Running   0          89s
pod/loki-stack-promtail-cts5j            1/1     Running   0          89s
pod/loki-stack-promtail-frwvw            1/1     Running   0          89s
pod/loki-stack-promtail-wn4dw            1/1     Running   0          89s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/loki-stack              ClusterIP   10.110.208.35    <none>        3100/TCP   90s
service/loki-stack-grafana      ClusterIP   10.104.222.214   <none>        80/TCP     90s
service/loki-stack-headless     ClusterIP   None             <none>        3100/TCP   90s
service/loki-stack-memberlist   ClusterIP   None             <none>        7946/TCP   90s

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-stack-promtail   4         4         4       4            4           <none>          90s

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-stack-grafana   1/1     1            1           90s

NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-stack-grafana-dff58c989   1         1         1       90s

NAME                          READY   AGE
statefulset.apps/loki-stack   1/1     91s

❯ KUBECONFIG=gc.kubeconfig kg ing -n loki-stack
NAME                 CLASS   HOSTS                                 ADDRESS        PORTS   AGE
loki-stack-grafana   nginx   grafana-loki-vineethac-poc.test.com   10.216.24.45   80      7m16s
❯

Now in my case I've an ingress controller and dns resolution in place. If you don't have those configured, you can just port forward the loki-stack-grafana service to view the Grafana dashboard.

To get the username and password you should decode the following secret:

❯ KUBECONFIG=gc.kubeconfig kg secrets -n loki-stack loki-stack-grafana -oyaml

Login to the Grafana instance and verify the Data Sources section, and it must be already configured. Now click on explore option and use the log browser to query logs.

Hope it was useful. Cheers!

Saturday, August 5, 2023

vSphere with Tanzu using NSX-T - Part28 - Create a custom VM Class

A VM class is a template that defines CPU, memory, and reservations for VMs. If you want to create a custom vmclass you can use dcli or vSphere UI.

Following is an example using dcli:

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id best-effort-16xlarge --cpu-count 64 --memory-mb 131072

This will create a vmclass with 64 vCPUs and 128GB memory with no reservations.

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id guaranteed-16xlarge --cpu-count 64 --memory-mb 131072 --cpu-reservation 100 --memory-reservation 100

This will create a vmclass with 64 vCPUs and 128GB memory with 100% reservations.

Note: You will need to attach this newly created vmclass to a supervisor namespace to use it.

Here is the documentation reference: https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-with-tanzu-services-workloads/GUID-18C7B2E3-BCF5-488C-9C50-937E29BB0C48.html

Hope it was useful. Cheers!

Sunday, July 9, 2023

vSphere with Tanzu using NSX-T - Part27 - nullfinalizer kubectl plugin

I have seen many cases where the supervisor namespace gets stuck at Terminating phase waiting on finalization on some of its child resources. This plugin can be used for setting finalizer to null for all objects of a specified api resource under a supervisor namespace. It will be helpful in cleaning up supervisor namespaces stuck terminating phase and can be also used to clean up stale resources under a supervisor namespace.

kubectl-nullfinalizer

#!/bin/bash

Help()
{
   # Display Help
   echo "This plugin sets finalizer to null for specified resource in a namespace."
   echo "Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME"
   echo "Example: kubectl nullfinalizer vineetha-svns01 pvc"
}

# Get the options
while getopts ":h" option; do
   case $option in
      h) # display Help
         Help
         exit;;
     \?) # incorrect option
         echo "Error: Invalid option"
         exit;;
   esac
done

kubectl get -n $1 $2 --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n $1 $2 {} -p '{"metadata":{"finalizers": null}}' --type=merge

Usage

Place the plugin in the system executable path.
I placed it in $HOME/.krew/bin in my laptop.
Once you copied the plugin to the proper path, you can make it executable by: chmod 755 kubectl-nullfinalizer .
After that you should be able to run the plugin as: kubectl nullfinalizer SUPERVISORNAMESPACE RESOURCENAME .

Example

Following is an exmaple of a supervisor namespace stuck at Terminating phase. While describe you can see that it is waiting on finalization.

❯ k config current-context
wdc-08-vc07
❯ kg ns svc-sct-bot-dogfooding
NAME                     STATUS        AGE
svc-sct-bot-dogfooding   Terminating   584d
❯
❯ kg ns svc-sct-bot-dogfooding -oyaml

status:
  conditions:
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: 'Some resources are remaining: clusters.cluster.x-k8s.io has 1 resource
      instances, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io has 1 resource
      instances, machines.cluster.x-k8s.io has 4 resource instances, persistentvolumeclaims.
      has 9 resource instances, projects.registryagent.vmware.com has 1 resource instances,
      tanzukubernetesclusters.run.tanzu.vmware.com has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2023-09-26T04:45:21Z"
    message: 'Some content in the namespace has finalizers remaining: cluster.cluster.x-k8s.io
      in 1 resource instances, cns.vmware.com/pvc-protection in 9 resource instances,
      controller-finalizer in 1 resource instances, kubeadm.controlplane.cluster.x-k8s.io
      in 1 resource instances, machine.cluster.x-k8s.io in 4 resource instances, tanzukubernetescluster.run.tanzu.vmware.com
      in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating

❯ kg pvc -n svc-sct-bot-dogfooding
NAME                                 STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
gc1-workers-r9jvb-4sfjc-containerd   Terminating   pvc-0d9f4a38-86ad-41d8-ab11-08707780fd85   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc1-workers-r9jvb-szg9r-containerd   Terminating   pvc-ca6b6ec4-85fa-464c-abc6-683358994f3f   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc1-workers-r9jvb-zbdt8-containerd   Terminating   pvc-8f2b0683-ebba-46cb-a691-f79a0e94d0e2   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc2-workers-vpzl2-ffkgx-containerd   Terminating   pvc-69e64099-42c8-44b5-bef2-2737eca49c36   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc2-workers-vpzl2-hww5v-containerd   Terminating   pvc-5a909482-4c95-42c7-b55a-57372f72e75f   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc2-workers-vpzl2-stsnh-containerd   Terminating   pvc-ed7de540-72f4-4832-8439-da471bf4c892   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc3-workers-2qr4c-64xpz-containerd   Terminating   pvc-38478f19-8180-4b9b-b5a9-8c06f17d0fbc   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   510d
gc3-workers-2qr4c-dpng5-containerd   Terminating   pvc-a8b12657-10bd-4993-b08e-51b7e9b259f9   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
gc3-workers-2qr4c-wfvvd-containerd   Terminating   pvc-01c6b224-9dc0-4e03-b87e-641d4a4d0d95   70Gi       RWO            wdc-08-vc07c01-wcp-mgmt   538d
❯
❯ k nullfinalizer -h
This plugin sets finalizer to null for specified resource in a namespace.
Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME
Example: kubectl nullfinalizer vineetha-svns01 pvc
❯
❯
❯ k nullfinalizer svc-sct-bot-dogfooding pvc
persistentvolumeclaim/gc1-workers-r9jvb-4sfjc-containerd patched
persistentvolumeclaim/gc1-workers-r9jvb-szg9r-containerd patched
persistentvolumeclaim/gc1-workers-r9jvb-zbdt8-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-ffkgx-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-hww5v-containerd patched
persistentvolumeclaim/gc2-workers-vpzl2-stsnh-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-64xpz-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-dpng5-containerd patched
persistentvolumeclaim/gc3-workers-2qr4c-wfvvd-containerd patched
❯
❯
❯ kg projects.registryagent.vmware.com -n svc-sct-bot-dogfooding
NAME                     AGE
svc-sct-bot-dogfooding   584d
❯
❯ k nullfinalizer -h
This plugin sets finalizer to null for specified resource in a namespace.
Usage: kubectl nullfinalizer SVNAMESPACE RESOURCENAME
Example: kubectl nullfinalizer vineetha-svns01 pvc
❯
❯ k nullfinalizer svc-sct-bot-dogfooding projects.registryagent.vmware.com
project.registryagent.vmware.com/svc-sct-bot-dogfooding patched
❯
❯
❯ kg ns svc-sct-bot-dogfooding
Error from server (NotFound): namespaces "svc-sct-bot-dogfooding" not found
❯

Hope it was useful. Cheers!

Pages

Thursday, July 4, 2024

Monday, July 1, 2024

Thursday, June 27, 2024

Case 1: TKC Control Plane Node Connectivity Issues

Case 2: TKC Worker Node Connectivity Issues

Case 3: Load Balancer Connectivity Issues

Resolution/ work around

Wednesday, June 26, 2024

Verify Tier-0 Gateway status on NSX-T

Verify BGP on all Edge nodes that are part of T0 Gateway

Verify BGP on Cisco TOR switches

Packet capture from Edge node

Packet capture from ESXi

Sunday, June 23, 2024

Saturday, May 25, 2024

The Issue

Troubleshooting steps

Saturday, April 20, 2024

Requirements

Deployment

Validation

Monday, January 15, 2024

Full project in my GitHub

Saturday, December 30, 2023

Full project in my GitHub

Sunday, December 3, 2023

Full project in my GitHub

High-level steps to complete this project

Saturday, November 18, 2023

Deploy using Helm

Verify

Saturday, August 5, 2023

Sunday, July 9, 2023

kubectl-nullfinalizer

Usage

Example