This blog series captures practical learnings from working with GPUs in real‑world environments, with a focus on operations, reliability, and scale. Each post deep‑dives into specific aspects of GPU systems based on hands‑on experience, incidents, and operational challenges. Together, these articles aim to share actionable insights, highlight common pitfalls, and help teams build more robust and predictable GPU operations.
A blog on the evolving infrastructure stack - virtualization, Kubernetes, and GPUs.
Saturday, March 7, 2026
Thursday, July 4, 2024
vSphere with Tanzu using NSX-T - Part35 - Monitoring supervisor cluster health with Python and vCenter APIs
You can access the Python script from my GitHub repository: https://github.com/vineethac/VMware/tree/main/vSphere_with_Tanzu/wcp_cluster_health
This script connects to the vCenter server, retrieves the cluster summary, and checks the Tanzu Supervisor cluster configuration info and prints the status of the cluster. By using this Python script, you can easily monitor the health of your Tanzu Supervisor clusters through vCenter APIs.
Sunday, December 3, 2023
Kubernetes mini project
In this mini project, we are going to learn the following:
- Deploy a simple Python based web application on a Kubernetes cluster.
- We will use Helm to deploy this app.
- This web app uses FastAPI and exposes some metrics using the Prometheus Python client.
- To store and visualize these metrics we will deploy Prometheus and Grafana in the K8s cluster.
- We will also deploy and use an ingress controller for exposing the web app, Prometheus, and Grafana to external users.
- For logging we will deploy and use Grafana Loki stack.
Full project in my GitHub
High-level steps to complete this project
Step1: Write the Python app.
Step2: Create the Dockerfile for the app.
Step3: Create the container image.
Step4: Push the container image to an image registry like Docker Hub.
Step5: Get access to a K8s cluster.
Step6: Deploy an ingress controller.
Step7: Create the Helm chart for your app and deploy it to the K8s cluster.
Step8: Deploy Prometheus stack on the K8s cluster using Helm.
Step9: Create a servicemonitor resource which defines the target to be monitored by Prometheus.
Step10: Verify targets and service discovery in Prometheus.
Step11: Configure Grafana dashboard and verify.
Step12. Deploy Grafana Loki stack using Helm.
Hope it was useful. Cheers!
Friday, September 22, 2023
Configure syslog forwarding in vCenter servers using Python
You can access the Python script from my GitHub repository:
https://github.com/vineethac/VMware/tree/main/vCenter/syslog_forwarding
In this blog, we've demonstrated how to get, test, and set syslog forwarding configuration in vCenter servers using Python. By following these steps, you can ensure that your vCenter servers are properly configured to collect and forward system logs to a central location for monitoring and analysis. Remember to replace the placeholders in the config file with your actual vCenter server names, syslog server IP address or hostname, port, and protocol.
Hope it was useful. Cheers!
Sunday, July 23, 2023
Kubernetes 101 - Part11 - Find Kubernetes nodes with DiskPressure
jq:
kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[].reason=="KubeletHasDiskPressure") | .metadata.name'
jsonpath:
kubectl get nodes -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.status.conditions[?(@.type=="DiskPressure")].status} {" "} {"\n"}'
❯ kubectl get no
NAME STATUS ROLES AGE VERSION
tkc-btvsm-72hz2 Ready control-plane,master 124d v1.23.8+vmware.3
tkc-btvsm-79xtn Ready control-plane,master 124d v1.23.8+vmware.3
tkc-btvsm-klmjz Ready control-plane,master 124d v1.23.8+vmware.3
tkc-workers-2cmvm-5bfcc5c9cd-gmv6m Ready <none> 5d17h v1.23.8+vmware.3
tkc-workers-2cmvm-5bfcc5c9cd-m44sq Ready <none> 5d17h v1.23.8+vmware.3
tkc-workers-2cmvm-5bfcc5c9cd-mjjlk Ready <none> 5d17h v1.23.8+vmware.3
tkc-workers-2cmvm-5bfcc5c9cd-wflrl Ready <none> 5d17h v1.23.8+vmware.3
tkc-workers-2cmvm-5bfcc5c9cd-xnqvk Ready <none> 5d17h v1.23.8+vmware.3
❯
❯
❯ kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[].reason=="KubeletHasDiskPressure") | .metadata.name'
tkc-workers-2cmvm-5bfcc5c9cd-m44sq
tkc-workers-2cmvm-5bfcc5c9cd-wflrl
❯
❯ kubectl get nodes -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.status.conditions[?(@.type=="DiskPressure")].status} {" "} {"\n"}'
tkc-btvsm-72hz2 False
tkc-btvsm-79xtn False
tkc-btvsm-klmjz False
tkc-workers-2cmvm-5bfcc5c9cd-gmv6m False
tkc-workers-2cmvm-5bfcc5c9cd-m44sq True
tkc-workers-2cmvm-5bfcc5c9cd-mjjlk False
tkc-workers-2cmvm-5bfcc5c9cd-wflrl True
tkc-workers-2cmvm-5bfcc5c9cd-xnqvk False
%
❯Hope it was useful. Cheers!
Sunday, June 27, 2021
vSphere with Tanzu using NSX-T - Part9 - Monitoring
In the previous posts we discussed the following:
Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
In this article, I will explain some of the popular tools used for monitoring Kubernetes clusters that provides insight into different objects in K8s, status, metrics, logs, and so on.
- Lens
- Octant
- Prometheus and Grafana
- vROps and Kubernetes Management Pack
- Kubebox
-Lens-
Download the Lens binary file from: https://k8slens.dev/
I am installing it on a Windows server. Once the installation is complete, the first thing you have to do is to provide the Kube config file details so that Lens can connect to the Kubernetes cluster and start monitoring it.
Add Cluster
Click File - Add Cluster
You can either browse and select the Kube config file or you can paste the content of your Kube config file as text. I am just pasting it as text.
Once you have pasted your Kube config file contents, make sure to select the context, and then click Add cluster.
-Octant-
https://vineethac.blogspot.com/2020/08/visualize-your-kubernetes-clusters-and.html
-Prometheus and Grafana-
-vROps and Kubernetes Management Pack-
https://rudimartinsen.com/2021/03/07/vrops-kubernetes-mgmt-pack/
https://www.brockpeterson.com/post/vrops-management-pack-for-kubernetes
-Kubebox-
curl -Lo kubebox https://github.com/astefanutti/kubebox/releases/download/v0.9.0/kubebox-linux && chmod +x kubebox
This will show the selected pod metrics and logs.
Saturday, January 23, 2021
Benchmarking Kubernetes infrastructure using K-Bench
K-Bench is a framework to benchmark the control and data plane aspects of a Kubernetes cluster. More details are available at https://github.com/vmware-tanzu/k-bench. In my case, I am going to conduct this benchmarking study on a Tanzu Kubernetes cluster which is provisioned using Tanzu Kubernetes Grid service on a vSphere 7 U1 cluster.
Step 1: Clone the K-Bench repo
git clone https://github.com/vmware-tanzu/k-bench.git
./install.sh
Once the installation is done it will say, "Completed k-bench installation.".
Step 3: Run the benchmark
./run.sh
Usage: ./run.sh -r <run-tag> [-t <comma-separated-tests> -o <output-dir>]Related posts
Storage performance benchmarking of Tanzu Kubernetes clusters
Monitoring Tanzu Kubernetes cluster using Prometheus and Grafana
References
Friday, January 1, 2021
Dell EMC PowerFlex MP for vROps 8.x - Part7 - Create custom reports
In March 2020, I published a blog on how to create custom views and reports in vROps 8.x. This article explains how to create a custom storage report for Dell EMC PowerFlex using the PowerFlex Management Pack for vROps 8.x.
Sample PowerFlex Storage Report PDF and template is available in my GitHub repo for download. You can use it as a starting point/ modify it as per requirement.
To create a new view: Dashboards - Views - Add.
Provide a name and description for the new view. Here, for example, I will create a view that shows PowerFlex Protection Domain Info.
You can also select and change the units and transformation as per requirements. Once it is done, click Save.
- Provide a name and description for the new report template.
- From the views and dashboards, find the PowerFlex Protection Domain Info view that we created earlier, double-click or drag and drop them to the right pane. You can add multiple views to be included in this report template.
- Select PDF and CSV.
- Select all the layout options if you like to and click Save.
- Now the custom report template is created. You can select it and click Run.
Related posts
Tuesday, December 29, 2020
Dell EMC PowerFlex MP for vROps 8.x Blog Series
![]() |
| Image source: infohub.delltechnologies.com/section-assets/powerflex-vrops-infographics |
Consolidating all my blogs on Dell EMC PowerFlex Management Pack for vROps 8.x.
Part1: Install
Part2: Configure
Part3: Dashboards
Part4: Resource kinds and relationships
Part5: Collection interval
Part6: Create custom alerts
Part7: Create custom reports
References
Tuesday, December 15, 2020
Dell EMC PowerFlex MP for vROps 8.x - Part6 - Create custom alerts
In this post, we will take a look at creating custom alerts for PowerFlex by adding symptom definitions and alert definitions. Refer to my previous blog post to understand more about the alerting aspects in vROps. Here we will take an example scenario and see how we can create custom symptom definitions and alert definitions.
Scenario
Step1: Add Symptom Definitions
- Select the metric User Data SDC Read Latency (ms): double click on it twice so that you can define both warning and critical symptoms.
- Select the metric User Data SDC Write Latency (ms): double click on it twice so that you can define both warning and critical symptoms.
Step2: Add Alert Definitions
- Provide alert name, select the base object type and advanced settings and click Next.
- Filter and search the symptoms that we created earlier. Drag and drop the two volume read latency related symptoms and select Any. Click Next.
- If you want to provide any recommendations you can add it in this step and click Next.
- Select vSphere Solution's Default Policy and click Next and click Create.
Now, we are all done. Let's test the alerts! I am using FIO to generate IO load on one of the PowerFlex volume.











































