This blog series captures practical learnings from working with GPUs in real‑world environments, with a focus on operations, reliability, and scale. Each post deep‑dives into specific aspects of GPU systems based on hands‑on experience, incidents, and operational challenges. Together, these articles aim to share actionable insights, highlight common pitfalls, and help teams build more robust and predictable GPU operations.
A blog on the evolving infrastructure stack - virtualization, Kubernetes, and GPUs.
Saturday, March 7, 2026
Wednesday, June 26, 2024
vSphere with Tanzu using NSX-T - Part32 - Troubleshooting BGP related issues
This article provides basic guidance on troubleshooting BGP related issues.
![]() |
| Sample diagram showing connectivity between Edge Nodes and TOR switches |
Verify Tier-0 Gateway status on NSX-T
- Status of T0 should be Success.
- Check the interfaces of T0 to identify which all edge nodes are part of it.
- Check the status of Edge Transport Nodes.
- As you can see from the T0 interfaces, Edge01/02/03/04 are part of it and in those edge nodes you should be able to see the SR_TIER0 component. Next step is to login to those Edge nodes that are part of T0 and verify BGP summary.
Verify BGP on all Edge nodes that are part of T0 Gateway
- SSH into the edge node as admin user.
- get logical-router
- Look for SERVICE_ROUTER_TIER0.
sc2-01-nsxt04-r08edge02> get logical-router Logical Router UUID VRF LR-ID Name Type Ports Neighbors 736a80e3-23f6-5a2d-81d6-bbefb2786666 0 0 TUNNEL 4 22/5000 e6d02207-c51e-4cf8-81a6-44afec5ad277 2 84653 DR-t1-domain-c1034:1de3adfa-0ee DISTRIBUTED_ROUTER_TIER1 5 9/50000 a590f1da-2d79-4749-8153-7b174d23b069 32 85271 DR-t1-domain-c1034:1de3adfa-0ee DISTRIBUTED_ROUTER_TIER1 5 5/50000 758d9736-6781-4b3a-906f-3d1b03f0924d 33 88016 DR-t1-domain-c1034:1de3adfa-0ee DISTRIBUTED_ROUTER_TIER1 4 1/50000 5e7bfe98-0b5e-4620-90b1-204634e99127 37 3 SR-sc2-01-nsxt04-tr SERVICE_ROUTER_TIER0 6 5/50000
- vrf <SERVICE_ROUTER_TIER0 VRF>
- get bgp neighbor summary
- Note: If everything is working fine State should show Estab.
sc2-01-nsxt04-r08edge02> vrf 37 sc2-01-nsxt04-r08edge02(tier0_sr[37])> get bgp neighbor summary BFD States: NC - Not configured, DC - Disconnected AD - Admin down, DW - Down, IN - Init, UP - Up BGP summary information for VRF default for address-family: ipv4Unicast Router ID: 10.184.248.2 Local AS: 4259971071 Neighbor AS State Up/DownTime BFD InMsgs OutMsgs InPfx OutPfx 10.184.248.239 4259970544 Estab 05w1d22h NC 12641393 12610093 2 568 10.184.248.240 4259970544 Estab 05w1d23h NC 12640337 11580431 2 566
- You should be able to ping to the BGP neighbor IP. If you are unable to ping to neighbor IPs, then there is an issue.
sc2-01-nsxt04-r08edge02(tier0_sr[37])> ping 10.184.248.239 PING 10.184.248.239 (10.184.248.239): 56 data bytes 64 bytes from 10.184.248.239: icmp_seq=0 ttl=255 time=1.788 ms ^C --- 10.184.248.239 ping statistics --- 2 packets transmitted, 1 packets received, 50.0% packet loss round-trip min/avg/max/stddev = 1.788/1.788/1.788/0.000 ms sc2-01-nsxt04-r08edge02(tier0_sr[37])> ping 10.184.248.240 PING 10.184.248.240 (10.184.248.240): 56 data bytes 64 bytes from 10.184.248.240: icmp_seq=0 ttl=255 time=1.925 ms 64 bytes from 10.184.248.240: icmp_seq=1 ttl=255 time=1.251 ms ^C --- 10.184.248.240 ping statistics --- 3 packets transmitted, 2 packets received, 33.3% packet loss round-trip min/avg/max/stddev = 1.251/1.588/1.925/0.337 ms
- Get interfaces | more
sc2-01-nsxt04-r08edge02> vrf 37 sc2-01-nsxt04-r08edge02(tier0_sr[37])> get interfaces | more Fri Aug 19 2022 UTC 11:07:18.042 Logical Router UUID VRF LR-ID Name Type 5e7bfe98-0b5e-4620-90b1-204634e99127 37 3 SR-sc2-01-nsxt04-tr SERVICE_ROUTER_TIER0 Interfaces (IPv6 DAD Status A-DAD_Success, F-DAD_Duplicate, T-DAD_Tentative, U-DAD_Unavailable) Interface : dd83554d-47c0-5a4e-9fbe-3abb1239a071 Ifuid : 335 Mode : cpu Port-type : cpu Enable-mcast : false Interface : 008b2b15-17d1-4cc8-9d94-d9c4c2d0eb3a Ifuid : 1000 Name : tr-interconnect-edge02 Fwd-mode : IPV4_AND_IPV6 Internal name : uplink-1000 Mode : lif Port-type : uplink IP/Mask : 10.184.248.2/24 MAC : 02:00:70:51:9d:79 VLAN : 1611
Verify BGP on Cisco TOR switches
- SSH to TOR switch.
- show ip bgp summary
❯ ssh -o PubkeyAuthentication=no netadmin@sc2-01-r08lswa.xxxxxxxx.com User Access Verification (netadmin@sc2-01-r08lswa.xxxxxxxx.com) Password: Cisco Nexus Operating System (NX-OS) Software sc2-01-r08lswa# show ip bgp summary BGP summary information for VRF default, address family IPv4 Unicast BGP router identifier 10.184.17.248, local AS number 65001.65008 BGP table version is 520374, IPv4 Unicast config peers 10, capable peers 8 5150 network entries and 11372 paths using 2003240 bytes of memory BGP attribute entries [110/18920], BGP AS path entries [69/1430] BGP community entries [0/0], BGP clusterlist entries [0/0] 11356 received paths for inbound soft reconfiguration 11356 identical, 0 modified, 0 filtered received paths using 0 bytes Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.184.10.14 4 65011.65000 47979514 10570342 520374 0 0 5w1d 4541 10.184.10.78 4 65011.65000 47814555 10601750 520374 0 0 5w1d 4541 10.184.248.1 4 65001.65535 80831 79447 520374 0 0 02:41:51 566 10.184.248.2 4 65001.65535 3215614 3269391 520374 0 0 5w1d 566 10.184.248.3 4 65001.65535 3215776 3269344 520374 0 0 1w3d 566 10.184.248.4 4 65001.65535 3215676 3269383 520374 0 0 13:51:45 566 10.184.248.5 4 65001.65535 3200531 3269384 520374 0 0 5w1d 5 10.184.248.6 4 65001.65535 3197752 3266700 520374 0 0 5w1d 5
- show ip arp
sc2-01-r08lswa# show ip arp 10.184.248.2 Flags: * - Adjacencies learnt on non-active FHRP router + - Adjacencies synced via CFSoE # - Adjacencies Throttled for Glean CP - Added via L2RIB, Control plane Adjacencies PS - Added via L2RIB, Peer Sync RO - Re-Originated Peer Sync Entry D - Static Adjacencies attached to down interface IP ARP Table Total number of entries: 1 Address Age MAC Address Interface Flags 10.184.248.2 00:06:12 0200.7051.9d79 Vlan1611
- If you compare this IP and MAC, you can see that its the same of your T0 SR uplink of your edge02 node.
IP/Mask : 10.184.248.2/24 MAC : 02:00:70:51:9d:79
For further troubleshooting you can do packet capture from the edge nodes and ESXi server and analyze them using Wireshark.
Packet capture from Edge node
- Capture packets from the T0 SR uplink interface.
sc2-01-nsxt04-r08edge01(tier0_sr[5])> get interfaces | more Wed Aug 17 2022 UTC 13:52:48.203 Logical Router UUID VRF LR-ID Name Type fb1ad846-8757-4fdf-9cbb-5c22ba772b52 5 2 SR-sc2-01-nsxt04-tr SERVICE_ROUTER_TIER0 Interfaces (IPv6 DAD Status A-DAD_Success, F-DAD_Duplicate, T-DAD_Tentative, U-DAD_Unavailable) Interface : c8b80ba1-93fc-5c82-a44f-4f4863b6413c Ifuid : 286 Mode : cpu Port-type : cpu Enable-mcast : false Interface : 4915d978-9c9a-58bc-84e2-cafe5442cba4 Ifuid : 287 Mode : blackhole Port-type : blackhole Interface : 899bcf30-83e2-46bb-9be2-8889ec52b354 Ifuid : 833 Name : tr-interconnect-edge01 Fwd-mode : IPV4_AND_IPV6 Internal name : uplink-833 Mode : lif Port-type : uplink IP/Mask : 10.184.248.1/24 MAC : 02:00:70:d1:92:b1 VLAN : 1611 Access-VLAN : untagged LS port : 15b971e9-7caa-43b7-86c1-96ff50453402 Urpf-mode : STRICT_MODE DAD-mode : LOOSE RA-mode : SLAAC_DNS_TRHOUGH_RA(M=0, O=0) Admin : up Op_state : up Enable-mcast : False MTU : 9000 arp_proxy :
- Start a continuous ping from the TOR switches to the edge uplink IP (in this case ping 10.184.248.1 from TOR switches) before starting packet capture.
sc2-01-nsxt04-r08edge01> start capture interface 899bcf30-83e2-46bb-9be2-8889ec52b354 file uplink.pcap
Note:
Find the location of uplink.pcap file on TOR switches and SCP it locally to analyze using Wireshark.
Packet capture from ESXi
- In this example, we are capturing packets of sc2-01-nsxt04-r08edge01 VM from the switchports where its interfaces are connected. sc2-01-nsxt04-r08edge01 VM is running on ESXi node sc2-01-r08esx10.
[root@sc2-01-r08esx10:~] esxcli network vm list | grep edge 18790721 sc2-01-nsxt04-r08edge05 3 , , 18977245 sc2-01-nsxt04-r08edge01 3 , , [root@sc2-01-r08esx10:/tmp] esxcli network vm port list -w 18977245 Port ID: 67109446 vSwitch: sc2-01-vc16-dvs Portgroup: DVPort ID: b60a80c0-ecd6-40bd-8d2b-fbd1f06bb172 MAC Address: 02:00:70:33:a9:67 IP Address: 0.0.0.0 Team Uplink: vmnic1 Uplink Port ID: 2214592517 Active Filters: Port ID: 67109447 vSwitch: sc2-01-vc16-dvs Portgroup: DVPort ID: 6e3d8057-fc23-4180-b0ba-bed90381f0bf MAC Address: 02:00:70:d1:92:b1 IP Address: 0.0.0.0 Team Uplink: vmnic1 Uplink Port ID: 2214592517 Active Filters: Port ID: 67109448 vSwitch: sc2-01-vc16-dvs Portgroup: DVPort ID: c531df19-294d-4079-b39c-89a3b58e30ad MAC Address: 02:00:70:30:c7:01 IP Address: 0.0.0.0 Team Uplink: vmnic0 Uplink Port ID: 2214592519 Active Filters:
- Start a continuous ping from the TOR switches to the edge uplink IP (in this case ping 10.184.248.1 from TOR switches) before starting packet capture.
[root@sc2-01-r08esx10:/tmp] pktcap-uw --switchport 67109446 --dir 2 -o /tmp/67109446-02:00:70:33:a9:67.pcap --count 1000 & pktcap-uw --switchport 67109447 --dir 2 -o /tmp/67109447-02:00:70:d1:92:b1.pcap --count 1000 & pktcap-uw --switchport 67109448 --dir 2 -o /tmp/67109448-02:00:70:30:c7:01.pcap --count 1000
Note:
SCP the pcap files to laptop and use Wireshark to analyse them.
You can also do packet capture from physical uplinks (vmnic) of the ESXi node if required.
Hope it was useful. Cheers!
Saturday, November 18, 2023
vSphere with Tanzu using NSX-T - Part29 - Logging using Loki stack
Grafana Loki is a log aggregation system that we can use for Kubernetes. In this post we will deploy Loki stack on a Tanzu Kubernetes cluster.
❯ KUBECONFIG=gc.kubeconfig kg no
NAME STATUS ROLES AGE VERSION
tkc01-control-plane-k8fzb Ready control-plane,master 144m v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-4n5kh Ready <none> 132m v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-8pcc6 Ready <none> 128m v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-76d555c9-rx7jf Ready <none> 134m v1.23.8+vmware.3
❯❯ helm repo add grafana https://grafana.github.io/helm-charts
❯ helm repo update
❯ helm repo list
❯ helm search repo lokiI saved the values file using helm show values grafana/loki-stack and made necessary modifications as mentioned below.
- I enabled Grafana by setting
enabled: true. This will create a new Grafana instance. - I also added a section under
grafana.ingressin theloki-stack/values.yaml, that will create an ingress resource for this new Grafana instance.
Here is the values.yaml file.
test_pod:
enabled: true
image: bats/bats:1.8.2
pullPolicy: IfNotPresent
loki:
enabled: true
isDefault: true
url: http://{{(include "loki.serviceName" .)}}:{{ .Values.loki.service.port }}
readinessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 45
livenessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 45
datasource:
jsonData: "{}"
uid: ""
promtail:
enabled: true
config:
logLevel: info
serverPort: 3101
clients:
- url: http://{{ .Release.Name }}:3100/loki/api/v1/push
fluent-bit:
enabled: false
grafana:
enabled: true
sidecar:
datasources:
label: ""
labelValue: ""
enabled: true
maxLines: 1000
image:
tag: 8.3.5
ingress:
## If true, Grafana Ingress will be created
##
enabled: true
## IngressClassName for Grafana Ingress.
## Should be provided if Ingress is enable.
##
ingressClassName: nginx
## Annotations for Grafana Ingress
##
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
## Labels to be added to the Ingress
##
labels: {}
## Hostnames.
## Must be provided if Ingress is enable.
##
# hosts:
# - grafana.domain.com
hosts:
- grafana-loki-vineethac-poc.test.com
## Path for grafana ingress
path: /
## TLS configuration for grafana Ingress
## Secret must be manually created in the namespace
##
tls: []
# - secretName: grafana-general-tls
# hosts:
# - grafana.example.com
prometheus:
enabled: false
isDefault: false
url: http://{{ include "prometheus.fullname" .}}:{{ .Values.prometheus.server.service.servicePort }}{{ .Values.prometheus.server.prefixURL }}
datasource:
jsonData: "{}"
filebeat:
enabled: false
filebeatConfig:
filebeat.yml: |
# logging.level: debug
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash-loki:5044"]
logstash:
enabled: false
image: grafana/logstash-output-loki
imageTag: 1.0.1
filters:
main: |-
filter {
if [kubernetes] {
mutate {
add_field => {
"container_name" => "%{[kubernetes][container][name]}"
"namespace" => "%{[kubernetes][namespace]}"
"pod" => "%{[kubernetes][pod][name]}"
}
replace => { "host" => "%{[kubernetes][node][name]}"}
}
}
mutate {
remove_field => ["tags"]
}
}
outputs:
main: |-
output {
loki {
url => "http://loki:3100/loki/api/v1/push"
#username => "test"
#password => "test"
}
# stdout { codec => rubydebug }
}
# proxy is currently only used by loki test pod
# Note: If http_proxy/https_proxy are set, then no_proxy should include the
# loki service name, so that tests are able to communicate with the loki
# service.
proxy:
http_proxy: ""
https_proxy: ""
no_proxy: ""Deploy using Helm
❯ helm upgrade --install --atomic loki-stack grafana/loki-stack --values values.yaml --kubeconfig=gc.kubeconfig --create-namespace --namespace=loki-stack
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: gc.kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: gc.kubeconfig
Release "loki-stack" does not exist. Installing it now.
W1203 13:36:48.286498 31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:48.592349 31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.840670 31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1203 13:36:55.849356 31990 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: loki-stack
LAST DEPLOYED: Sun Dec 3 13:36:45 2023
NAMESPACE: loki-stack
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a datasource in Grafana.
See http://docs.grafana.org/features/datasources/loki/ for more detail.
Verify
❯ KUBECONFIG=gc.kubeconfig kg all -n loki-stack
NAME READY STATUS RESTARTS AGE
pod/loki-stack-0 1/1 Running 0 89s
pod/loki-stack-grafana-dff58c989-jdq2l 2/2 Running 0 89s
pod/loki-stack-promtail-5xmrj 1/1 Running 0 89s
pod/loki-stack-promtail-cts5j 1/1 Running 0 89s
pod/loki-stack-promtail-frwvw 1/1 Running 0 89s
pod/loki-stack-promtail-wn4dw 1/1 Running 0 89s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/loki-stack ClusterIP 10.110.208.35 <none> 3100/TCP 90s
service/loki-stack-grafana ClusterIP 10.104.222.214 <none> 80/TCP 90s
service/loki-stack-headless ClusterIP None <none> 3100/TCP 90s
service/loki-stack-memberlist ClusterIP None <none> 7946/TCP 90s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/loki-stack-promtail 4 4 4 4 4 <none> 90s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/loki-stack-grafana 1/1 1 1 90s
NAME DESIRED CURRENT READY AGE
replicaset.apps/loki-stack-grafana-dff58c989 1 1 1 90s
NAME READY AGE
statefulset.apps/loki-stack 1/1 91s
❯ KUBECONFIG=gc.kubeconfig kg ing -n loki-stack
NAME CLASS HOSTS ADDRESS PORTS AGE
loki-stack-grafana nginx grafana-loki-vineethac-poc.test.com 10.216.24.45 80 7m16s
❯
Now in my case I've an ingress controller and dns resolution in place. If you don't have those configured, you can just port forward the loki-stack-grafana service to view the Grafana dashboard.
To get the username and password you should decode the following secret:
❯ KUBECONFIG=gc.kubeconfig kg secrets -n loki-stack loki-stack-grafana -oyamlLogin to the Grafana instance and verify the Data Sources section, and it must be already configured. Now click on explore option and use the log browser to query logs.
Hope it was useful. Cheers!
Sunday, July 17, 2022
vSphere with Tanzu using NSX-T - Part16 - Troubleshooting content library related issues
In this article, we will take a look at troubleshooting some of the content library related issues that you may encounter while managing/ administering vSphere with Tanzu clusters.
Case 1:
Following is a sample log for this issue from the vmop-controller-manger:
Warning CreateFailure 5m29s (x26 over 50m) vmware-system-vmop/vmware-system-vmop-controller-manager-85484c67b7-9jncl/virtualmachine-controller deploy from content library failed for image "ob-19344082-tkgs-ova-ubuntu-2004-v1.21.6---vmware.1-tkg.1": POST https://sc2-01-vcxx.xx.xxxx.com:443/rest/com/vmware/vcenter/ovf/library-item/id:8b34e422-cc30-4d44-9d78-367528df0622?~action=deploy: 500 Internal Server Error
Case 2:
❯ kubectl get tkr
No resources found
This could happen if there are duplicate content libraries present in the VC with same Subscription URL. If you find duplicate CLs, try removing them. If there are CLs that are not being used, consider deleting them. Also, try synchronize the CL.
If this doesn't resolve the issue, try to delete and recreate the CL, and make sure you select the newly created CL under Cluster > Configure > Supervisor Cluster > General > Tanzu Kubernetes Grid Service > Content Library.
You may also verify the vmware-system-vmop-controller-manager pod logs and capw-controller-manager pod logs. Check if those pods are running, or getting continuously restarted. If required you may restart those pods.
Case 3:
E0803 18:51:30.638787 1 vmprovider.go:155] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "vmName"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.638821 1 virtualmachine_controller.go:660] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.638851 1 virtualmachine_controller.go:358] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "name"="rkatz-testmigrationvm5/gc-lab-control-plane-kxwn2"
E0803 18:51:30.639301 1 controller.go:246] controller "msg"="Reconciler error" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": deploy error: The operation failed due to An error occurred during host configuration." "controller"="virtualmachine" "name"="gc-lab-control-plane-kxwn2" "namespace"="rkatz-testmigrationvm5" "reconcilerGroup"="vmoperator.xxxx.com" "reconcilerKind"="VirtualMachine"
get service cm-inventory
restart service cm-inventoryCase 4:
> Get-ContentLibrary | select Name,Id | fl
Name : wdc-01-vc18c01-wcp
Id : 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d
> kg contentsources
NAME AGE
0f00d3fa-de54-4630-bc99-aa13ccbe93db 173d
17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d 321d
451ce3f3-49d7-47d3-9a04-2839c5e5c662 242d
75e0668c-0cdc-421e-965d-fd736187cc57 173d
818c8700-efa4-416b-b78f-5f22e9555952 173d
9abbd108-aeb3-4b50-b074-9e6c00473b02 173d
a6cd1685-49bf-455f-a316-65bcdefac7cf 173d
acff9a91-0966-4793-9c3a-eb5272b802bd 242d
fcc08a43-1555-4794-a1ae-551753af9c03 173dIn the above sample case you can see multiple contentsource objects, but there is only one content library. So you can delete all the contentsource objects, except 17209f4b-3f7f-4bcb-aeaf-fd0b53b66d0d.
Hope it was useful. Cheers!
Saturday, September 25, 2021
vSphere with Tanzu using NSX-T - Part11 - Troubleshooting Tanzu Kubernetes Clusters
In the previous posts we discussed the following:
- Part1 - Prerequisites
- Part2 - Configure NSX
- Part3 - Edge Cluster
- Part4 - Tier-0 Gateway and BGP peering
- Part5 - Tier-1 Gateway and Segments
- Part6 - Create tags, storage policy, and content library
- Part7 - Enable workload management
- Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
- Part9 - Monitoring
- Part10 - Upgrade Tanzu Kubernetes Cluster
In this article, we will go through some basic kubectl commands that may help you in troubleshooting Tanzu Kubernetes clusters. I have noticed there are cases where the guest TKCs are getting stuck at creating or updating phases.
List all TKCs that are stuck at creating/ updating:
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep creating
kubectl get tanzukubernetescluster --all-namespaces --sort-by="metadata.creationTimestamp" | grep updating
On the newer versions of WCP, you may not see the TKC phase (creating/ updating/ running) in the kubectl output. I am using the following custom alias for it.
alias
kgtkc='kubectl get tkc -A -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:status.phase,CREATIONTIME:metadata.creationTimestamp,VERSION:spec.distribution.fullVersion,CP:spec.topology.controlPlane.replicas,WORKER:status.totalWorkerReplicas
--sort-by="metadata.creationTimestamp"'
You can add it to your ~/.zshrc file and relaunch the terminal. Example usage:
% kgtkc | grep updating
c1nsxtest1-sla gc updating 2021-01-21T08:23:37Z v1.19.7+vmware.1-tkg.2.f52f85a 3 3
w2cei-sep20 gc updating 2021-09-16T17:48:07Z v1.20.9+vmware.1-tkg.1.a4cee5b 1 4
For TKCs that are in creating phase, some of the most common reasons might be due to lack of sufficient resources to provision the nodes, or it maybe waiting for IP allocation, etc. For the TKCs that are stuck at updating phase, it may be due to reconciliation issues, newly provisioned nodes might be waiting for IP address, old nodes may be stuck at drain phase, nodes might be in notready state, specific OVA version is not available in the contnet library, etc. You can try the following kubectl commands to get more insight into whats happening:
See events in a namespace:
kubectl get events -n <namespace>
See all events:
kubectl get events -A
Watch events in a namespace:
kubectl get events -n <namespace> -w
List the Cluster API resources supporting the clusters in the current namespace:
kubectl get cluster-api -n <namespace>
Describe TKC:
kubectl describe tkc <tkc_name> -n <namespace>
List TKC virtual machines in a namespace:
kubectl get vm -n <namespace>
List TKC virtual machines in a namespace with its IP:
kubectl get vm -n <namespace> -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
List all nodes of a cluster:
kubectl get nodes -o wide
List all pods that are not running:
kubectl get pods -A | grep -vi running
List health status of different cluster components:
kubectl get --raw '/healthz?verbose'
% kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed
List all CRDs installed in your cluster and their API versions:
kubectl api-resources -o wide --sort-by="name"
List available Tanzu Kubernetes releases:
kubectl get tanzukubernetesreleases
List available virtual machine images:
kubectl get virtualmachineimages
List terminating namespaces:
kubectl get ns --field-selector status.phase=Terminating
You can ssh to the Tanzu Kubernetes cluster nodes as the system user following this:
https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-587E2181-199A-422A-ABBC-0A9456A70074.html
Here is an example where I have a TKC under namespace: vineetha-test05-deploy
% kubectl get tkc -n vineetha-test05-deploy
NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
gc 1 3 v1.20.9---vmware.1-tkg.1.a4cee5b 4d5h True True [1.21.2+vmware.1-tkg.1.ee25d55]
% kubectl get vm -n vineetha-test05-deploy -o json | jq -r '[.items[] | {namespace:.metadata.namespace, name:.metadata.name, internalIP: .status.vmIp}]'
[
{
"namespace": "vineetha-test05-deploy",
"name": "gc-control-plane-ttkmt",
"internalIP": "172.29.4.194"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-d286z",
"internalIP": "172.29.4.195"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-hwr8b",
"internalIP": "172.29.4.197"
},
{
"namespace": "vineetha-test05-deploy",
"name": "gc-workers-7fcql-6f984fdd59-r99x7",
"internalIP": "172.29.4.196"
}
]
apiVersion: v1
kind: Pod
metadata:
name: jumpbox
namespace: vineetha-test05-deploy #REPLACE
spec:
containers:
- image: "photon:3.0"
name: jumpbox
command: [ "/bin/bash", "-c", "--" ]
args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ]
volumeMounts:
- mountPath: "/root/ssh"
name: ssh-key
readOnly: true
resources:
requests:
memory: 2Gi
volumes:
- name: ssh-key
secret:
secretName: gc-ssh #REPLACE
Once you apply the above yaml, you can see the jumpbox pod.
% kubectl get pod -n vineetha-test05-deploy
NAME READY STATUS RESTARTS AGE
jumpbox 1/1 Running 0 22m
Now, you can connect to the TKC node with its internal IP.
% kubectl -n vineetha-test05-deploy exec -it jumpbox -- /usr/bin/ssh vmware-system-user@172.29.4.194
Welcome to Photon 3.0 (\m) - Kernel \r (\l)
Last login: Mon Nov 22 16:36:40 2021 from 172.29.4.34
16:50:34 up 4 days, 5:49, 0 users, load average: 2.14, 0.97, 0.65
26 Security notice(s)
Run 'tdnf updateinfo info' to see the details.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ hostname
gc-control-plane-ttkmt
You can check the status of control plane pods using crictl ps.
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
bde228417c55a 9000c334d9197 4 days ago Running guest-cluster-auth-service 0 d7abf3db8670d
bc4b8c1bf0e33 a294c1cf07bd6 4 days ago Running metrics-server 0 2665876cf939e
46a94dcf02f3e 92cb72974660c 4 days ago Running coredns 0 7497cdf3269ab
f7d32016d6fb7 f48f23686df21 4 days ago Running csi-resizer 0 b887d394d4f80
ef80f62f3ed65 2cba51b244f27 4 days ago Running csi-provisioner 0 b887d394d4f80
64b570add2859 4d2e937854849 4 days ago Running liveness-probe 0 b887d394d4f80
c0c1db3aac161 d032188289eb5 4 days ago Running vsphere-syncer 0 b887d394d4f80
e4df023ada129 e75228f70c0d6 4 days ago Running vsphere-csi-controller 0 b887d394d4f80
e79b3cfdb4143 8a857a48ee57f 4 days ago Running csi-attacher 0 b887d394d4f80
96e4af8792cd0 b8bffc9e5af52 4 days ago Running calico-kube-controllers 0 b5e467a43b34a
23791d5648ebb 92cb72974660c 4 days ago Running coredns 0 9bde50bbfb914
0f47d11dc211b ab1e2f4eb3589 4 days ago Running guest-cluster-cloud-provider 0 fde68175c5d95
5ddfd46647e80 4d2e937854849 4 days ago Running liveness-probe 0 1a88f26173762
578ddeeef5bdd e75228f70c0d6 4 days ago Running vsphere-csi-node 0 1a88f26173762
3fcb8a287ea48 9a3d9174ac1e7 4 days ago Running node-driver-registrar 0 1a88f26173762
91b490c14d085 dc02a60cdbe40 4 days ago Running calico-node 0 35cf458eb80f8
68dbbdb779484 f7ad2965f3ac0 4 days ago Running kube-proxy 0 79f129c96e6e1
ef423f4aeb128 75bfe47a404bb 4 days ago Running docker-registry 0 752724fbbcd6a
26dd8e1f521f5 9358496e81774 4 days ago Running kube-apiserver 0 814e5d2be5eab
62745db4234e2 ab8fb8e444396 4 days ago Running kube-controller-manager 0 94543f93f7563
f2fc30c2854bd 9aa6da547b7eb 4 days ago Running etcd 0 f0a756a4cdc09
b8038e9f90e15 212d4c357a28e 4 days ago Running kube-scheduler 0 533a44c70e86c
You can check the status of kubelet and containerd services:
sudo systemctl status kubelet.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status kubelet.service
WARNING: terminal is not fully functional
- (press RETURN)● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset:>
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2021-11-18 11:01:54 UTC; 4 days ago
Docs: http://kubernetes.io/docs/
Main PID: 2234 (kubelet)
Tasks: 16 (limit: 4728)
Memory: 88.6M
CGroup: /system.slice/kubelet.service
└─2234 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boots>
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.065785 >
Nov 22 16:32:06 gc-control-plane-ttkmt kubelet[2234]: W1122 16:32:06.067045 >
sudo systemctl status containerd.service
vmware-system-user@gc-control-plane-ttkmt [ ~ ]$
<udo systemctl status containerd.service
WARNING: terminal is not fully functional
- (press RETURN)● containerd.service - containerd container runtime
Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor pres>
Active: active (running) since Thu 2021-11-18 11:01:23 UTC; 4 days ago
Docs: https://containerd.io
Main PID: 1783 (containerd)
Tasks: 386 (limit: 4728)
Memory: 639.3M
CGroup: /system.slice/containerd.service
├─ 1783 /usr/local/bin/containerd
├─ 1938 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
├─ 1939 containerd-shim -namespace k8s.io -workdir /var/lib/containe>
If you have issues related to the provisioning/ deployment of TKC, you can check the logs present in the CP node:
vmware-system-user@gc-control-plane-ttkmt [ /var/log ]$ ls
audit devicelist sa vmware-vgauthsvc.log.0
auth.log journal sgidlist vmware-vmsvc-root.log
btmp kubernetes stigreport.log vmware-vmtoolsd-root.log
cloud-init.log lastlog suidlist wtmp
cloud-init-output.log pods tallylog
containers private vmware-imc
cron rpmcheck vmware-network.log
Following is a great VMware blog series/ videos covering the different resources involved in the deployment process and troubleshooting aspects of TKCs that are provisioned using the TKG service running on the supervisor cluster.
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-1
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-2
https://core.vmware.com/blog/tanzu-kubernetes-grid-service-troubleshooting-deep-dive-part-3
Hope it was useful. Cheers!






