This blog series captures practical learnings from working with GPUs in real‑world environments, with a focus on operations, reliability, and scale. Each post deep‑dives into specific aspects of GPU systems based on hands‑on experience, incidents, and operational challenges. Together, these articles aim to share actionable insights, highlight common pitfalls, and help teams build more robust and predictable GPU operations.
A blog on the evolving infrastructure stack - virtualization, Kubernetes, and GPUs.
Saturday, March 7, 2026
Sunday, January 25, 2026
Working with GPUs - Part3 - Using dcgmi
The NVIDIA Data Center GPU Manager (DCGM) is a lightweight agent that performs several functions like GPU behavior monitoring, health and diagnostics, policy management, etc. DCGM is the underlying framework, and when you install it, it runs a service called the Host Engine, which collects data, monitors health, and manages GPUs. DCGMI is simply the CLI tool (the interface) used to talk to the engine. If you want to know what the engine is seeing or if you want to tell the engine to do something, you use dcgmi.
Install DCGM
- In my case, I am installing it on Ubuntu 22.04.2 host.
- You can download the required binaries from this Nvidia repository.
- If there are previous versions of the package installed, you can follow this documentation and remove them.
- For example, I am installing version 4.5.2, which is compatible with cuda 13.0.
- Download the following packages from the above mentioned repo:
- datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb
- Install them.
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb
# Enable the dcgm service
systemctl --now enable nvidia-dcgm
systemctl start nvidia-dcgm
# Check if the service is active
systemctl is-active --quiet nvidia-dcgm- Verify the installed version using: dcgmi --version
# dcgmi --version dcgmi version: 4.5.2
- List all GPUs discovered by the host engine: dcgmi discovery -l
DCGM Diagnostics
Diagnostics is a subsystem within DCGM designed to stress-test and validate the physical and logical integrity of the GPU. It is a suite of automated tests that push the GPU beyond its normal idle state to uncover hidden hardware defects, driver instabilities, or environmental issues (like poor cooling or failing power supplies). In production environments, this utility helps to assess cluster readiness levels before a workload is deployed on it. It supports multiple run levels as explained below.
- Level 1: used for sanity Check which run before starting a container or job to ensure the GPU is "alive."
- Level 2: used for analyzing/ examining failures and to get more context about it.
- Level 3/4: for extensive hardware screening (e.g., checking for thermal throttling, bandwidth checks, etc.).
# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.5.2 |
| Driver Version Detected | 580.105.08 |
| GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|----- Deployment --------+------------------------------------------------|
| software | Fail |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU3: Fail |
| Warning: GPU3 | Page Retirement/Row Remap: GPU 3 had uncorrec |
| | table memory errors and row remapping failed. |
| | Run a field diagnostic on the GPU. |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
+----- Hardware ----------+------------------------------------------------+
| memory | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| diagnostic | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| nvbandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Integration -------+------------------------------------------------+
| pcie | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Stress ------------+------------------------------------------------+
| memory_bandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_stress | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_power | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+---------------------------+------------------------------------------------+# dcgmi diag -r 4
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.5.2 |
| Driver Version Detected | 580.105.08 |
| GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|----- Deployment --------+------------------------------------------------|
| software | Fail |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU3: Fail |
| Warning: GPU3 | Page Retirement/Row Remap: GPU 3 had uncorrec |
| | table memory errors and row remapping failed. |
| | Run a field diagnostic on the GPU. |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
+----- Hardware ----------+------------------------------------------------+
| memory | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| diagnostic | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| nvbandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| pulse_test | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Integration -------+------------------------------------------------+
| pcie | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Stress ------------+------------------------------------------------+
| memtest | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| memory_bandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_stress | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_power | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+---------------------------+------------------------------------------------+
Here you can see GPU 3 had uncorrectable memory errors and row remap failed.
The dcgmi diag utility consists of multiple plugins as detailed below. Based on the selected run levels, respective plugins will be used to conduct the tests.
- Deployment: verifies the compute environment is ready to run CUDA applications and is able to load the NVML library.
- Diagnostic: performs large matrix multiplications. This will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors like XIDs, temperature violations, uncorrectable memory errors, etc. as well as the correctness of data being written and read.
- PCIe - GPU bandwidth: purpose of this plugin is to stress the communication from the host to the GPUs as well as among the GPUs on the system. It will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe.
- GPU memory: It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues.
- Targeted power: This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.
- Targeted stress: maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.
- Memtest diagnostic: this is similar to memtest86, which will exercise GPU memory with various test patterns.
- Pulse test: this is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
- NVbandwidth: performs bandwidth measurements on NVIDIA GPUs on a single host.
- Memory bandwidth: It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.
Note: It is highly recommended to run these diagnostic tests while the node is in maintenance mode or when no active workloads (such as training jobs or inference services) are running on the GPU. Attempting to run higher-level diagnostics (especially levels 3 and 4) on an active node is a recipe for trouble: the diagnostic tests will likely fail to get the resources they need, and the contention for compute engines and VRAM may cause your production workloads to crash.
References
Saturday, October 18, 2025
Working with GPUs - Part1 - Using nvidia-smi
GPUs are the backbone of modern AI and HPC clusters, and understanding their basic health and configuration is the first step toward reliable operations. In this first post of the series, we start with nvidia-smi - the primary tool for discovering Nvidia GPUs, validating drivers and CUDA versions, and performing essential health checks. These fundamentals form the baseline for monitoring, performance benchmarking, and troubleshooting GPU compute nodes at scale.
Verify version
nvidia-smi --version
nvidia-smi -L
Current state of every GPU
nvidia-smi
Following are the key observations from the above output:
- All 8 GPUs detected (NVIDIA H100 80GB HBM3). Confirms full hardware enumeration.
- HBM3 memory present (80GB per GPU). Validates expected SKU (H100 SXM vs PCIe). This is important because SXM GPUs behave differently in power, cooling, and NVLink bandwidth; troubleshooting playbooks differ by form factor.
- Driver version 550.90.07 with CUDA compatibility 12.4. This confirms a Hopper‑supported, production‑ready driver stack. Many issues (NCCL failures, DCGM errors, framework crashes) trace back to unsupported driver–CUDA combinations.
- Persistence Mode: On. This avoids GPU driver reinitialization delays and flaky behavior between jobs. Turning this off in clusters can cause intermittent job start failures or longer warm‑up times.
- Temperatures in 34–41 °C range at idle. This indicates healthy cooling and airflow. High idle temperatures point to heatsink issues, airflow obstructions, fan/BMC problems, or thermal paste degradation.
- Performance State: P0 (highest performance). This shows GPUs are not power or thermally‑throttled. If GPUs remain in lower P‑states under load, suspect thermal limits, power caps, or firmware misconfigurations.
- Power usage ~70–76 W with cap at 700 W. This confirms ample power headroom and no throttling. GPUs hitting the power cap during load may show reduced performance even when utilization appears high.
- GPU utilization at 0% and no running processes. This confirms the node is idle and clean. Useful to rule out “ghost” workloads, leaked CUDA contexts, or stuck processes when diagnosing performance drops.
- Memory usage ~1 MiB per GPU. Only driver bookkeeping allocations present. Any significant memory use at idle suggests leftover processes or failed container teardown.
- Volatile Uncorrected ECC errors: 0. Confirms memory integrity. Any non‑zero uncorrected ECC errors are serious and usually justify isolating the GPU and starting RMA/vendor diagnostics.
- MIG mode: Disabled. Ensures full GPU and NVLink bandwidth availability. MIG partitions can severely impact NCCL and large‑model training if enabled unintentionally.
- Compute mode: Default. Allows multiple processes (expected in shared clusters). Exclusive modes can cause unexpected job failures or scheduling issues.
- Fan: N/A (SXM platform). Normal for chassis‑controlled cooling. Fan values appearing unexpectedly may indicate incorrect sensor readings or platform misidentification.
Health metrics of all GPUs
nvidia-smi -q
- Serial Number
- VBIOS Version
- GPU Part Number
- Utilization
- ECC Errors
- Temperature, etc.
Query GPU health metrics
GPU temperature status: nvidia-smi --query-gpu=index,name,uuid,temperature.gpu,temperature.gpu.tlimit,temperature.memory --format=csv
GPU reset state: nvidia-smi --query-gpu=index,name,uuid,reset_status.reset_required,reset_status.drain_and_reset_recommended --format=csv
- reset_status.reset_required - indicates whether the GPU must be reset to return to a clean operational state.
- reset_status.drain_and_reset_recommended - Yes, indicates GPU/ node should be drained first, then reset. No, indicates reset can be done immediately.
- Note: In production GPU clusters based on Kubernetes, the safest and recommended practice is to always drain the node before attempting GPU recovery. For H100 SXM systems, recovery is performed via node reboot, not individual GPU resets.
NVLink topology
nvidia-smi topo -m
Friday, August 15, 2025
Understanding NUMA: Its Impact on VM Performance in ESXi
VMware ESXi hosts use Non-Uniform Memory Access (NUMA) architecture to optimize CPU and memory locality. Each NUMA node consists of a subset of CPUs and memory. Accessing local memory within the same NUMA node is significantly faster than remote memory access. Misaligned NUMA configurations can lead to latency spikes, increased CPU Ready Time, and degraded VM performance.
Key symptoms
The common symptoms for Virtual Machines (VMs) on ESXi that have a misconfigured or misaligned Non-Uniform Memory Access (NUMA) configuration primarily manifest as performance degradation and latency. The main issue caused by NUMA misalignment is that the VM's vCPUs end up frequently having to access memory that belongs to a different physical NUMA node on the ESXi host (known as Remote Access), which is significantly slower than accessing local memory.
The resulting symptoms for the VM include:
- Overall Slowness and Unresponsiveness: Services and applications running inside the guest OS may respond slowly or intermittently. The entire VM can feel sluggish.
- High CPU Ready Time (%RDY): This is the most critical ESXi-level metric. CPU Ready Time represents the percentage of time a VM was ready to run but could not be scheduled on a physical CPU. High %RDY times (often above 5% or 10%) can indicate that the VM's vCPUs are struggling to get scheduled efficiently, which happens when they are spread across multiple NUMA nodes (NUMA spanning).
- Excessive Remote Memory Access: When a VM consumes more vCPUs or memory than is available on a single physical NUMA node, a portion of its memory traffic becomes "remote." You can check this using the esxtop utility on the ESXi host.
Common misconfigurations
- Over-Sized VM: Allocating more vCPUs than the physical cores available in a single physical NUMA node or allocating more memory than the physical memory on a single NUMA node.
- Hot-Add Features: Enabling CPU Hot-Add or Memory Hot-Add can disable vNUMA (Virtual NUMA) for the VM, preventing the VMkernel from presenting an optimized NUMA topology to the guest OS.
- Incorrect Cores per Socket Setting: While vSphere 6.5 and later are smarter about vNUMA, configuring the Cores per Socket value manually in a way that doesn't align with the host's physical NUMA topology can still lead to poor scheduling and memory placement, particularly when licensing dictates a low number of virtual sockets.
- Setting VM Limits: Setting a memory limit on a VM that is lower than its configured memory can force the VMkernel to allocate the remaining memory from a remote NUMA node.
Check NUMA assignments in ESXi
- SSH into the ESXi node.
- Issue the esxtop command and press m for memory view, then press f to enable the fields, G to enable NUMA information.
- You should be able to view the NUMA related information like NRMEM, NLMEM, and N%L.
- NRMEM (MB): NUMA Remote MEMory
- This is the current amount of a VM's memory (in MB) that is physically located on a remote NUMA node relative to where the VM's vCPUs are currently running.
- High NRMEM indicates NUMA locality issues, meaning the vCPUs must cross the high-speed interconnect (like Intel's QPI/UPI or AMD's Infinity Fabric) to access some of their data, which results in slower performance.
- NLMEM (MB): NUMA Local MEMory
- This is the current amount of a VM's memory (in MB) that is physically located on the local NUMA node, meaning it's on the same physical node as the vCPUs accessing it.
- The ESXi NUMA scheduler's goal is to maximize NLMEM to ensure fast memory access.
- N%L: NUMA % Locality
- This is the percentage of the VM's total memory that resides on the local NUMA node.
- A value close to 100% is ideal, indicating excellent memory locality. If this value drops below 80%, the VM may experience poor NUMA locality and potential performance issues due to slower remote memory access.
- Issue the esxtop command and press v to see the virtual machine screen.
- From the virtual machine screen note down the GID of the VM under consideration, and press q to exit the screen.
- Now issue the sched-stats -t numa-clients command. This will list down NUMA details of the VM. Check the groupID column to match the GID of the VM.
- For example, the GID of the VM I am looking at is 7886858. This is a 112 CPU VM which is running on an 8-socket physical host.
- You can see the VM is spread/ placed under NUMA nodes 0, 1, 2, and 3.
- The remoteMem is 0, for each of these NUMA nodes, which means they are accessing all the local memory of the NUMA node.
- To view physical NUMA details of the ESXI you can use
sched-stats -t numa-pnodecommand. You can see this server has 8 NUMA nodes.
- To view the NUMA latency, you can use the sched-stats -t numa-latency command.
Verify NUMA node details at guest OS
- Easiest way is to go to Task Manager - Performance - CPU
- Right click on the CPU utilization graph and select Change graph to - NUMA nodes
- If there only one NUMA node, you may notice the option as greyed out.
- To get detailed info you can consider using the sysinternals utility coreinfo64.
- To view NUMA related details from the Linux guest OS layer, you can use the following commands:
lscpu | grep -i NUMA
dmesg | grep -i NUMA- Right-Size VMs: Keep vCPU count within physical cores of a single NUMA node.
- Evenly Divide Resources: For monster/ wide VMs, ensure the total vCPUs are configured such that they are evenly divisible by the number of physical NUMA nodes they span.
- Example: If a VM needs 16 vCPUs on a host with 12-core NUMA nodes, configure the vCPUs to be a multiple of a NUMA node count (e.g., 2 sockets $\times$ 8 cores per socket to create 2 vNUMA nodes, aligning with 2 pNUMA nodes).
- Cores per Socket Setting (Important for older vSphere/Licensing): While vSphere 6.5 and later automatically present an optimal vNUMA topology, you should still configure the Cores per Socket setting on the VM to create a vNUMA structure that aligns with the physical NUMA boundaries of the host. This helps the guest OS make better scheduling decisions.
- Disable VM CPU/ Memory Hot-Add: Plan capacity upfront.
References
- 64 Cores per NUMA Node Limit in Microsoft SQL Server: Recommendation for Efficiently Allocating Logical CPUs to SQL Server VMs on VMware vSphere - VMware Cloud Foundation (VCF) Blog
- Architecting Microsoft SQL Server on VMware vSphere | VMware
- Virtual Machine vCPU and vNUMA Rightsizing - Guidelines - VROOM! Performance Blog
- Setting corespersocket can affect guest OS topologies
- Performance Best Practices for VMware vSphere 7.0, Update 3
- Home - Flings (broadcom.com) (Virtual Machine Compute Optimizer)
- vSphere 7 Cores per Socket and Virtual NUMA - frankdenneman.nl
- How many NUMA Nodes do I have? – SQLpassion
- Coreinfo - Sysinternals | Microsoft Learn
Friday, September 6, 2024
Revisiting Storage Performance Benchmarking
Few years ago, I had the opportunity to explore the intricacies of storage performance benchmarking using tools like FIO, DISKSPD, and Iometer. Those studies provided valuable insights into the performance characteristics of various storage solutions, shaping my understanding and approach to storage performance analysis. As I prepare for an upcoming project in this domain, I find it essential to revisit my previous work, reflect on the lessons learned, and share my experiences. This blog post aims to provide a comprehensive overview of my benchmarking journey and the evolving landscape of storage performance studies.
Recent advancements
The field of storage technology has seen significant advancements in recent years. The rise of NVMe and storage-class memory technologies has also redefined high-end storage performance, offering unprecedented speed and efficiency. These advancements highlight the dynamic nature of storage performance benchmarking and underscore the importance of staying updated with the latest tools and methodologies.
Challenges
Benchmarking storage performance is not without its challenges. One of the primary difficulties is ensuring a consistent and controlled testing environment, as variations in hardware, software, and network conditions can significantly impact results. Another challenge is the selection of appropriate benchmarks that accurately reflect real-world workloads, which requires a deep understanding of the specific use cases and performance metrics. Additionally, interpreting the results can be complex, as it involves analyzing multiple metrics such as IOPS, throughput, and latency, and understanding their interplay. These challenges necessitate meticulous planning and a thorough understanding of both the benchmarking tools and the storage systems being tested.
Prior works
Following are some of the articles on storage benchmarking that I’ve published in the past:
- Benchmarking vSphere environment using HCIBench
- vSAN performance benchmarking considerations
- Stress test your storage system using Iometer
- Storage performance benchmarking of Kubernetes using FIO StatefulSet
- Benchmarking Kubernetes using K-Bench
Custom storage benchmarking framework
While there are numerous storage benchmarking tools available, such as VMFleet and HCIBench, I wanted to highlight a custom framework I developed a few years ago. Here are some reasons why we created this custom tool:
- Great learning experience: It provided valuable insights into how things work.
- Customization: Being a custom framework, it allows you to add or remove features as needed.
- Flexibility: You can modify multiple parameters to suit your requirements.
- Custom test profiles: You can create tailored storage test profiles.
- No IP assignment needed: There’s no need for IP assignment or DHCP for the stress test VMs.
- Centralized log collection: It offers centralized log collection for detailed analysis.
You can access the scripts and readme on my GitHub repository:
https://github.com/vineethac/vsan_cluster_storage_benchmarking_with_diskspd
Here is an overview.
- Profile Manifest: All storage test profiles are listed in profile_manifest.psd1. You can define as many profiles as you want.
- VM Template: A Windows VM template should be present in the vCenter server.
- Benchmarking Manifest: Details of vCenter, cluster name, VM template, number of stress test VMs per host, etc., are provided in benchmarking_manifest.psd1.
- Deploy Test VMs: deploy_test_vms.ps1 will deploy all the test VMs with pre-configured parameters.
- Start Stress Test: start_stress_test.ps1 will initiate the storage stress test process for all the profiles mentioned in profile_manifest.psd1 one by one.
- Log Collection: All log files will be automatically copied to a central location on the host from where these scripts are running.
- Cleanup: Use delete_test_vms.ps1 to clean up the stress test VMs from the cluster.
Note: These scripts were created about five years ago, and I haven’t had the opportunity to refactor them according to current best practices and new PowerShell scripting standards. I plan to enhance them in the coming months!
This overview should provide you with a clear understanding of the overall process and workflow involved in the storage benchmarking process. I hope it was useful. Cheers!
Sunday, January 30, 2022
vSphere with Tanzu using NSX-T - Part14 - Testing TKC storage using kubestr
In the previous posts we discussed the following:
- Part1 - Prerequisites
- Part2 - Configure NSX
- Part3 - Edge Cluster
- Part4 - Tier-0 Gateway and BGP peering
- Part5 - Tier-1 Gateway and Segments
- Part6 - Create tags, storage policy, and content library
- Part7 - Enable workload management
- Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
- Part9 - Monitoring
- Part10 - Upgrade Tanzu Kubernetes Cluster
- Part11 - Troubleshooting TKC
- Part12 - Deploy application on TKC and access it
- Part13 - Export WCP admin kubeconfig
This
article is about using kubestr to test storage options of Tanzu Kubernetes Cluster (TKC). Following are the steps to install kubestr on MAC:
- wget https://github.com/kastenhq/kubestr/releases/download/v0.4.31/kubestr_0.4.31_MacOS_amd64.tar.gz
- tar -xvf kubestr_0.4.31_MacOS_amd64.tar.gz
- chmod +x kubestr
- mv kubestr /usr/local/bin
Now, lets do kubestr help.
% kubestr help
kubestr is a tool that will scan your k8s cluster
and validate that the storage systems in place as well as run
performance tests.
Usage:
kubestr [flags]
kubestr [command]
Available Commands:
browse Browse the contents of a CSI PVC via file browser
csicheck Runs the CSI snapshot restore check
fio Runs an fio test
help Help about any command
Flags:
-h, --help help for kubestr
-e, --outfile string The file where test results will be written
-o, --output string Options(json)
Use "kubestr [command] --help" for more information about a command.
I am going to use the following TKC for testing.
% KUBECONFIG=gc.kubeconfig kubectl get nodes
NAME STATUS ROLES AGE VERSION
gc-control-plane-pwngg Ready control-plane,master 103d v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-cz766 Ready <none> 103d v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-f6zqs Ready <none> 103d v1.20.9+vmware.1
gc-workers-wrknn-f675446b6-rsf6n Ready <none> 103d v1.20.9+vmware.1
Let's run kubestr against the cluster now.
% KUBECONFIG=gc.kubeconfig kubestr
**************************************
_ ___ _ ___ ___ ___ _____ ___
| |/ / | | | _ ) __/ __|_ _| _ \
| ' <| |_| | _ \ _|\__ \ | | | /
|_|\_\\___/|___/___|___/ |_| |_|_\
Explore your Kubernetes storage options
**************************************
Kubernetes Version Check:
Valid kubernetes version (v1.20.9+vmware.1) - OK
RBAC Check:
Kubernetes RBAC is enabled - OK
Aggregated Layer Check:
The Kubernetes Aggregated Layer is enabled - OK
W0130 14:17:16.937556 87541 warnings.go:70] storage.k8s.io/v1beta1 CSIDriver is deprecated in v1.19+, unavailable in v1.22+; use storage.k8s.io/v1 CSIDriver
Available Storage Provisioners:
csi.vsphere.xxxx.com:
Can't find the CSI snapshot group api version.
This is a CSI driver!
(The following info may not be up to date. Please check with the provider for more information.)
Provider: vSphere
Website: https://github.com/kubernetes-sigs/vsphere-csi-driver
Description: A Container Storage Interface (CSI) Driver for VMware vSphere
Additional Features: Raw Block,<br/><br/>Expansion (Block Volume),<br/><br/>Topology Aware (Block Volume)
Storage Classes:
* sc2-01-vc16c01-wcp-mgmt
To perform a FIO test, run-
./kubestr fio -s <storage class>
You can run storage tests using kubestr and it uses FIO for generating IOs. For example this is how you can run a basic storage test.
% KUBECONFIG=gc.kubeconfig kubestr fio -s sc2-01-vc16c01-wcp-mgmt -z 10G
PVC created kubestr-fio-pvc-zvdhr
Pod created kubestr-fio-pod-kdbs5
Running FIO test (default-fio) on StorageClass (sc2-01-vc16c01-wcp-mgmt) with a PVC of Size (10G)
Elapsed time- 29.290421119s
FIO test results:
FIO version - fio-3.20
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1
JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=3987.150391 BW(KiB/s)=15965
iops: min=3680 max=4274 avg=3992.034424
bw(KiB/s): min=14720 max=17096 avg=15968.827148
JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=3562.628906 BW(KiB/s)=14267
iops: min=3237 max=3750 avg=3565.896484
bw(KiB/s): min=12950 max=15000 avg=14264.862305
JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=2988.549316 BW(KiB/s)=383071
iops: min=2756 max=3252 avg=2992.344727
bw(KiB/s): min=352830 max=416256 avg=383056.187500
JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=2754.796143 BW(KiB/s)=353151
iops: min=2480 max=2992 avg=2759.586182
bw(KiB/s): min=317440 max=382976 avg=353242.781250
Disk stats (read/write):
sdd: ios=117160/105647 merge=0/1210 ticks=2100090/2039676 in_queue=4139076, util=99.608589%
- OK
As you can see, a PVC of 10G, a FIO pod will be created, and this will be used for the FIO test. Once the test is complete, the PVC and FIO pod will be deleted automatically.
I hope it was useful. Cheers!


