vineethac.blogspot.com

A blog on the evolving infrastructure stack - Virtualization, Kubernetes, and GPUs.

Pages

▼
Saturday, February 28, 2026

Working with GPUs - Part4 - Thermal issues

›
During routine diagnostics using dcgmi diag -r 3 test suite, several GPU nodes began failing due to thermal throttling issues. This blog po...
Sunday, January 25, 2026

Working with GPUs - Part3 - Using dcgmi

›
The NVIDIA Data Center GPU Manager (DCGM) is a lightweight agent that performs several functions like GPU behavior monitoring, health and di...
Saturday, November 15, 2025

Working with GPUs - Part2 - Memory fault indicators

›
AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Un...
Saturday, October 18, 2025

Working with GPUs - Part1 - Using nvidia-smi

›
GPUs are the backbone of modern AI and HPC clusters, and understanding their basic health and configuration is the first step toward reliabl...
Saturday, August 16, 2025

Understanding NUMA: Its Impact on VM Performance in ESXi

›
VMware ESXi hosts use Non-Uniform Memory Access (NUMA) architecture to optimize CPU and memory locality. Each NUMA node consists of a subset...
‹
›
Home
View web version
Powered by Blogger.