vineethac.blogspot.com

Friday, November 14, 2025

Working with GPUs - Part2 - Memory fault indicators

AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Understanding which signals truly indicate GPU memory issues, and how to act on them is essential for stable operations at scale. This post focuses on authoritative memory fault indicators on Nvidia H100 GPUs, how they differ, and how to use them together to make correct operational decisions.

HBM3 (High Bandwidth Memory 3) memory on H100 delivers massive bandwidth, but it operates under extreme thermal, electrical, and utilization stress. When memory reliability starts degrading:

Model training can fail intermittently
NCCL performance may collapse unexpectedly
Silent data corruption risks increase
Faulty GPUs can impact entire multi‑GPU jobs

Early detection lets you drain, reset, isolate, or RMA a GPU before a customer‑visible incident occurs.

Primary indicators

ECC errors

ECC (Error Correcting Code) detects and reports bit‑level memory errors in HBM3.

ECC error types:

Correctable Errors (CE)

Single‑bit errors fixed automatically
Indicate beginning of memory stress or aging

Uncorrectable Errors (UE)

Multi‑bit errors not recoverable
High risk of data corruption
Immediate GPU isolation required

nvidia-smi -q -d ECC

nvidia-smi --query-gpu=index,name,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total --format=csv

nvidia-smi --query-gpu=index,name,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total --format=csv

Notes

Rising CE counts is an early warning. Monitor closely.
Any UE count > 0, then drain workload, isolate GPU, and proceed to fix it.

Remapped Rows

Row Remapping is a hardware healing mechanism in H100 HBM3.

When the GPU identifies failing memory rows:

Faulty rows are permanently retired
Spare rows are mapped in their place

nvidia-smi -q -d ROW_REMAPPER

Notes

Remapped Rows Pending = Yes

GPU detected bad memory rows
Reset required to complete remap

Remapped Rows > 0

Hardware has already consumed spare memory
Strong early RMA signal

Row remapping is the earliest and strongest indicator of degrading HBM3 memory; often appearing before serious ECC failures.

Hope it was useful. Cheers!

Saturday, October 18, 2025

Working with GPUs - Part1 - Using nvidia-smi

GPUs are the backbone of modern AI and HPC clusters, and understanding their basic health and configuration is the first step toward reliable operations. In this first post of the series, we start with nvidia-smi - the primary tool for discovering Nvidia GPUs, validating drivers and CUDA versions, and performing essential health checks. These fundamentals form the baseline for monitoring, performance benchmarking, and troubleshooting GPU compute nodes at scale.

Verify version

nvidia-smi --version

List all GPUs

nvidia-smi -L

Current state of every GPU

nvidia-smi

Following are the key observations from the above output:

All 8 GPUs detected (NVIDIA H100 80GB HBM3). Confirms full hardware enumeration.
HBM3 memory present (80GB per GPU). Validates expected SKU (H100 SXM vs PCIe). This is important because SXM GPUs behave differently in power, cooling, and NVLink bandwidth; troubleshooting playbooks differ by form factor.
Driver version 550.90.07 with CUDA compatibility 12.4. This confirms a Hopper‑supported, production‑ready driver stack. Many issues (NCCL failures, DCGM errors, framework crashes) trace back to unsupported driver–CUDA combinations.
Persistence Mode: On. This avoids GPU driver reinitialization delays and flaky behavior between jobs. Turning this off in clusters can cause intermittent job start failures or longer warm‑up times.
Temperatures in 34–41 °C range at idle. This indicates healthy cooling and airflow. High idle temperatures point to heatsink issues, airflow obstructions, fan/BMC problems, or thermal paste degradation.
Performance State: P0 (highest performance). This shows GPUs are not power or thermally‑throttled. If GPUs remain in lower P‑states under load, suspect thermal limits, power caps, or firmware misconfigurations.
Power usage ~70–76 W with cap at 700 W. This confirms ample power headroom and no throttling. GPUs hitting the power cap during load may show reduced performance even when utilization appears high.
GPU utilization at 0% and no running processes. This confirms the node is idle and clean. Useful to rule out “ghost” workloads, leaked CUDA contexts, or stuck processes when diagnosing performance drops.
Memory usage ~1 MiB per GPU. Only driver bookkeeping allocations present. Any significant memory use at idle suggests leftover processes or failed container teardown.
Volatile Uncorrected ECC errors: 0. Confirms memory integrity. Any non‑zero uncorrected ECC errors are serious and usually justify isolating the GPU and starting RMA/vendor diagnostics.
MIG mode: Disabled. Ensures full GPU and NVLink bandwidth availability. MIG partitions can severely impact NCCL and large‑model training if enabled unintentionally.
Compute mode: Default. Allows multiple processes (expected in shared clusters). Exclusive modes can cause unexpected job failures or scheduling issues.
Fan: N/A (SXM platform). Normal for chassis‑controlled cooling. Fan values appearing unexpectedly may indicate incorrect sensor readings or platform misidentification.

Health metrics of all GPUs

nvidia-smi -q

This shows details like:

Serial Number
VBIOS Version
GPU Part Number
Utilization
ECC Errors
Temperature, etc.

Query GPU health metrics

Help: nvidia-smi --help-query-gpu

GPU memory usage and utilization: nvidia-smi --query-gpu=index,name,uuid,driver_version,memory.total,memory.used,utilization.gpu --format=csv

GPU temperature status: nvidia-smi --query-gpu=index,name,uuid,temperature.gpu,temperature.gpu.tlimit,temperature.memory --format=csv

GPU reset state: nvidia-smi --query-gpu=index,name,uuid,reset_status.reset_required,reset_status.drain_and_reset_recommended --format=csv

reset_status.reset_required - indicates whether the GPU must be reset to return to a clean operational state.
reset_status.drain_and_reset_recommended - Yes, indicates GPU/ node should be drained first, then reset. No, indicates reset can be done immediately.
Note: In production GPU clusters based on Kubernetes, the safest and recommended practice is to always drain the node before attempting GPU recovery. For H100 SXM systems, recovery is performed via node reboot, not individual GPU resets.

NVLink topology

nvidia-smi topo -m

Note: Any non‑NVLink GPU‑to‑GPU path on H100 SXM immediately explains poor NCCL performance and requires hardware correction.

nvidia-smi nvlink -s # shows per direction (Tx or Rx) bandwidth of all nvlinks of all GPUs

nvidia-smi nvlink -s -i 0 # shows per direction (Tx or Rx) bandwidth of all nvlinks of the GPU 0

Hope it was useful. Cheers!

Friday, August 15, 2025

Understanding NUMA: Its Impact on VM Performance in ESXi

VMware ESXi hosts use Non-Uniform Memory Access (NUMA) architecture to optimize CPU and memory locality. Each NUMA node consists of a subset of CPUs and memory. Accessing local memory within the same NUMA node is significantly faster than remote memory access. Misaligned NUMA configurations can lead to latency spikes, increased CPU Ready Time, and degraded VM performance.

Key symptoms

The common symptoms for Virtual Machines (VMs) on ESXi that have a misconfigured or misaligned Non-Uniform Memory Access (NUMA) configuration primarily manifest as performance degradation and latency. The main issue caused by NUMA misalignment is that the VM's vCPUs end up frequently having to access memory that belongs to a different physical NUMA node on the ESXi host (known as Remote Access), which is significantly slower than accessing local memory.

The resulting symptoms for the VM include:

Overall Slowness and Unresponsiveness: Services and applications running inside the guest OS may respond slowly or intermittently. The entire VM can feel sluggish.

High CPU Ready Time (%RDY): This is the most critical ESXi-level metric. CPU Ready Time represents the percentage of time a VM was ready to run but could not be scheduled on a physical CPU. High %RDY times (often above 5% or 10%) can indicate that the VM's vCPUs are struggling to get scheduled efficiently, which happens when they are spread across multiple NUMA nodes (NUMA spanning).

Excessive Remote Memory Access: When a VM consumes more vCPUs or memory than is available on a single physical NUMA node, a portion of its memory traffic becomes "remote." You can check this using the esxtop utility on the ESXi host.

Common misconfigurations

Misalignment often occurs when the VM's vCPU and memory settings exceed the resources of a single physical NUMA node on the host. Common causes include:

Over-Sized VM: Allocating more vCPUs than the physical cores available in a single physical NUMA node or allocating more memory than the physical memory on a single NUMA node.

Hot-Add Features: Enabling CPU Hot-Add or Memory Hot-Add can disable vNUMA (Virtual NUMA) for the VM, preventing the VMkernel from presenting an optimized NUMA topology to the guest OS.

Incorrect Cores per Socket Setting: While vSphere 6.5 and later are smarter about vNUMA, configuring the Cores per Socket value manually in a way that doesn't align with the host's physical NUMA topology can still lead to poor scheduling and memory placement, particularly when licensing dictates a low number of virtual sockets.

Setting VM Limits: Setting a memory limit on a VM that is lower than its configured memory can force the VMkernel to allocate the remaining memory from a remote NUMA node.

Check NUMA assignments in ESXi

SSH into the ESXi node.
Issue the esxtop command and press m for memory view, then press f to enable the fields, G to enable NUMA information.

You should be able to view the NUMA related information like NRMEM, NLMEM, and N%L.

NRMEM (MB): NUMA Remote MEMory

This is the current amount of a VM's memory (in MB) that is physically located on a remote NUMA node relative to where the VM's vCPUs are currently running.
High NRMEM indicates NUMA locality issues, meaning the vCPUs must cross the high-speed interconnect (like Intel's QPI/UPI or AMD's Infinity Fabric) to access some of their data, which results in slower performance.

NLMEM (MB): NUMA Local MEMory

This is the current amount of a VM's memory (in MB) that is physically located on the local NUMA node, meaning it's on the same physical node as the vCPUs accessing it.
The ESXi NUMA scheduler's goal is to maximize NLMEM to ensure fast memory access.

N%L: NUMA % Locality

This is the percentage of the VM's total memory that resides on the local NUMA node.
A value close to 100% is ideal, indicating excellent memory locality. If this value drops below 80%, the VM may experience poor NUMA locality and potential performance issues due to slower remote memory access.

Issue the esxtop command and press v to see the virtual machine screen.
From the virtual machine screen note down the GID of the VM under consideration, and press q to exit the screen.
Now issue the sched-stats -t numa-clients command. This will list down NUMA details of the VM. Check the groupID column to match the GID of the VM.
For example, the GID of the VM I am looking at is 7886858. This is a 112 CPU VM which is running on an 8-socket physical host.

You can see the VM is spread/ placed under NUMA nodes 0, 1, 2, and 3.
The remoteMem is 0, for each of these NUMA nodes, which means they are accessing all the local memory of the NUMA node.
To view physical NUMA details of the ESXI you can use sched-stats -t numa-pnode command. You can see this server has 8 NUMA nodes.

To view the NUMA latency, you can use the sched-stats -t numa-latency command.

Verify NUMA node details at guest OS

Windows

Easiest way is to go to Task Manager - Performance - CPU

Right click on the CPU utilization graph and select Change graph to - NUMA nodes
If there only one NUMA node, you may notice the option as greyed out.

To get detailed info you can consider using the sysinternals utility coreinfo64.

Linux

To view NUMA related details from the Linux guest OS layer, you can use the following commands:

lscpu | grep -i NUMA

dmesg | grep -i NUMA

Remediation

The most common remediation steps for fixing Non-Uniform Memory Access (NUMA) related performance issues in ESXi VMs revolve around right-sizing the VM to align its resources with the physical NUMA boundaries of the host.

The primary goal is to minimize Remote Memory Access (NRMEM) and maximize Local Memory Access (N%L). The vast majority of NUMA issues stem from a VM's resource allocation crossing a physical NUMA node boundary.

Right-Size VMs: Keep vCPU count within physical cores of a single NUMA node.
Evenly Divide Resources: For monster/ wide VMs, ensure the total vCPUs are configured such that they are evenly divisible by the number of physical NUMA nodes they span.

Example: If a VM needs 16 vCPUs on a host with 12-core NUMA nodes, configure the vCPUs to be a multiple of a NUMA node count (e.g., 2 sockets $\times$ 8 cores per socket to create 2 vNUMA nodes, aligning with 2 pNUMA nodes).

Cores per Socket Setting (Important for older vSphere/Licensing): While vSphere 6.5 and later automatically present an optimal vNUMA topology, you should still configure the Cores per Socket setting on the VM to create a vNUMA structure that aligns with the physical NUMA boundaries of the host. This helps the guest OS make better scheduling decisions.
Disable VM CPU/ Memory Hot-Add: Plan capacity upfront.

NUMA awareness is critical for troubleshooting and optimizing VM performance on ESXi. Misconfigured NUMA placements can severely impact latency-sensitive workloads like databases and analytics. Regular checks at both the hypervisor and guest OS layers ensure memory locality, reduce latency, and improve efficiency.

References

Hope it was useful. Cheers!

Saturday, July 12, 2025

Troubleshooting ESXi PSOD: A Quick Guide for SREs

When an ESXi host hits a Purple Screen of Death (PSOD), it’s more than just a crash - it’s a signal that something critical needs attention. Here’s how to handle it effectively.

What happens during a PSOD?

The ESXi server displays a purple diagnostic screen.
You’ll see alerts/ incidents for host connectivity, FC/ Ethernet link down, and related alarms.
The console screen confirms the purple screen.

Immediate actions

Capture screenshots of the PSOD from the console screen.
Check server hardware health via the out-of-band management interface like iDRAC/ RMC/ BMC.
Observe if the host is stuck or rebooting repeatedly.

If ESXi reloads successfully, immediately place the node in Maintenance Mode via vCenter.
If it keeps crashing, try to capture all PSOD instances.

Collecting logs

Generate a support bundle from vCenter once the host is online.
Collect server hardware logs.

Engage support

Broadcom/ VMware: Share PSOD screenshots and ESXi support bundle for RCA.
Hardware vendor: Attach server hardware logs, screenshots, and context for analysis.

Analyze crash dumps

Look for these keywords in the core dump logs: BlueScreen, Backtrace, Exception
In the ESXi support bundle you will find the crash dump logs under /var/core directory.
Analyzing the core dump files should help you find the root cause of the PSOD event. It could be due to some hardware issue, bugs in ESXi hypervisor, faults in device firmware or drivers, etc.
You may notice many vmkernel-zdump files, and to quickly filter out all the BlueScreen events, you can use the following PowerShell code snippet.

------------------------------------------------------------------

param(
    [string]$directoryPath,
    [string[]]$keywords
)

# Function to search for keywords in files
function Search-Files {
    param (
        [string]$path,
        [string[]]$keywords
    )

    # Get all files in the directory and subdirectories
    $files = Get-ChildItem -Path $path -Recurse -File

    # Loop through each file
    foreach ($file in $files) {
        # Read the content of the file
        $content = Get-Content -Path $file.FullName

        # Loop through each line in the file content
        foreach ($line in $content) {
            # Check if the line contains any of the keywords (case-insensitive)
            foreach ($keyword in $keywords) {
                if ($line -match "(?i)$keyword") {
                    # Print the file name and the matching line
                    Write-Output "File: $($file.FullName)"
                    Write-Output "Line: $line"
                    Add-Content -path out.txt -value $line
                    break
                }
            }
        }
    }
}

# Call the function with the provided parameters
Search-Files -path $directoryPath -keywords $keywords

------------------------------------------------------------------

Save the above code snippet to a .ps1 file (example: find.ps1) and you can run it as follows:

> .\find.ps1 -directoryPath "C:\esx-esxi1.xre.com-2025-04-18--13.05-2106842\var\core" -keywords "bluescreen"
> .\find.ps1 -directoryPath "C:\esx-esxi1.xre.com-2025-04-18--13.05-2106842\var\core" -keywords "bluescreen", "#PF Exception"

All the log lines that include the given keyword or keywords will be saved to out.txt file.
A sample output of the above-mentioned code snippet against an ESXi core dump is given below.

Once you identify the root cause of the PSOD event, you can start working towards the resolution which may involve replacing a faulty hardware component, updating firmware/ driver/ ESXi, etc.

References

Hope it was useful. Cheers!

Saturday, April 26, 2025

Azure AI Foundry - Part4 - Deploy and use a generative AI model

Azure AI Foundry supports deploying large language models (LLMs). In this article, we will see how to deploy a model and use it.

Azure AI Foundry Portal

Select your project - My assets - Models + endpoints - Deploy model
Click Deploy base model
Select the model you want to deploy (here I am selecting gpt-4.1) and click Confirm

You can see the deployment details like capacity (token per minute), resource location etc.
Click on Create resource and deploy

Now it will start creating the resource and this step may take a minute or so.
Once it is done, it will take you to the following page where you can see the mode details on the model you just deployed.

Click on Open in playground to test the model.
Once the chat playground is open, you will see your deployment, and under that you will see a section where you can give the model instructions and context. An example is given in the following screenshot. Once the model instructions and context are provided make sure to click Apply changes button.
Now you can click on Generate prompt, provide the query and click on Send.
You can also set values for limiting the maximum output token for the model response, temperature, frequency penalty etc. under the Parameters section.

A sample response is provided in the following screenshot.

To see the sample code, you can click on View code.

You can also see code samples and authentication using API key as shown below.

Metrics (total requests, token count, etc.) related to your LLM model deployment can be found on the following page.

Python

Sample code to interact with the model can be found in my GitHub repo.

Hope it was useful. Cheers!

Azure AI Foundry Blog Series

Azure AI Foundry is a comprehensive suite of tools and services designed to accelerate the development and deployment of AI solutions on the Azure platform. Throughout this blog series, we will cover various aspects of Azure AI Foundry.

Part1 - Create project
Part2 - Language translation using AI Services
Part3 - Abstractive text summarization
Part4 - Deploy and use a generative AI model

Tuesday, April 22, 2025

Azure AI Foundry - Part3 - Abstractive text summarization

In this article, I will show you how to use Azure Cognitive Services for text summarization.

Azure AI Foundry portal

AI Services - Language + Translator

Summarize Information - Summarize text

Select a connected AI service resource or create a new one.

Playgrounds - Summarize Information - Summarize text

Python

Sample code to summarize a PDF can be found in my GitHub repo. Following is an example of a resume summary:

Hope this was useful. Cheers!

Pages

Friday, November 14, 2025

Primary indicators

ECC errors

Remapped Rows

Saturday, October 18, 2025

Verify version

nvidia-smi --version

List all GPUs

Current state of every GPU

Health metrics of all GPUs

Query GPU health metrics

NVLink topology

Friday, August 15, 2025

Key symptoms

Common misconfigurations

Check NUMA assignments in ESXi

Verify NUMA node details at guest OS

Remediation

References

Saturday, July 12, 2025

What happens during a PSOD?

Immediate actions

Collecting logs

Engage support

Analyze crash dumps

References

Saturday, April 26, 2025

Azure AI Foundry Portal

Python

Tuesday, April 22, 2025

Azure AI Foundry portal

Python