vineethac.blogspot.com: nvidia-smi

Friday, November 14, 2025

Working with GPUs - Part2 - Memory fault indicators

AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Understanding which signals truly indicate GPU memory issues, and how to act on them is essential for stable operations at scale. This post focuses on authoritative memory fault indicators on Nvidia H100 GPUs, how they differ, and how to use them together to make correct operational decisions.

HBM3 (High Bandwidth Memory 3) memory on H100 delivers massive bandwidth, but it operates under extreme thermal, electrical, and utilization stress. When memory reliability starts degrading:

Model training can fail intermittently
NCCL performance may collapse unexpectedly
Silent data corruption risks increase
Faulty GPUs can impact entire multi‑GPU jobs

Early detection lets you drain, reset, isolate, or RMA a GPU before a customer‑visible incident occurs.

Primary indicators

ECC errors

ECC (Error Correcting Code) detects and reports bit‑level memory errors in HBM3.

ECC error types:

Correctable Errors (CE)

Single‑bit errors fixed automatically
Indicate beginning of memory stress or aging

Uncorrectable Errors (UE)

Multi‑bit errors not recoverable
High risk of data corruption
Immediate GPU isolation required

nvidia-smi -q -d ECC

nvidia-smi --query-gpu=index,name,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total --format=csv

nvidia-smi --query-gpu=index,name,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total --format=csv

Notes

Rising CE counts is an early warning. Monitor closely.
Any UE count > 0, then drain workload, isolate GPU, and proceed to fix it.

Remapped Rows

Row Remapping is a hardware healing mechanism in H100 HBM3.

When the GPU identifies failing memory rows:

Faulty rows are permanently retired
Spare rows are mapped in their place

nvidia-smi -q -d ROW_REMAPPER

Notes

Remapped Rows Pending = Yes

GPU detected bad memory rows
Reset required to complete remap

Remapped Rows > 0

Hardware has already consumed spare memory
Strong early RMA signal

Row remapping is the earliest and strongest indicator of degrading HBM3 memory; often appearing before serious ECC failures.

Hope it was useful. Cheers!

Saturday, October 18, 2025

Working with GPUs - Part1 - Using nvidia-smi

GPUs are the backbone of modern AI and HPC clusters, and understanding their basic health and configuration is the first step toward reliable operations. In this first post of the series, we start with nvidia-smi - the primary tool for discovering Nvidia GPUs, validating drivers and CUDA versions, and performing essential health checks. These fundamentals form the baseline for monitoring, performance benchmarking, and troubleshooting GPU compute nodes at scale.

Verify version

nvidia-smi --version

List all GPUs

nvidia-smi -L

Current state of every GPU

nvidia-smi

Following are the key observations from the above output:

All 8 GPUs detected (NVIDIA H100 80GB HBM3). Confirms full hardware enumeration.
HBM3 memory present (80GB per GPU). Validates expected SKU (H100 SXM vs PCIe). This is important because SXM GPUs behave differently in power, cooling, and NVLink bandwidth; troubleshooting playbooks differ by form factor.
Driver version 550.90.07 with CUDA compatibility 12.4. This confirms a Hopper‑supported, production‑ready driver stack. Many issues (NCCL failures, DCGM errors, framework crashes) trace back to unsupported driver–CUDA combinations.
Persistence Mode: On. This avoids GPU driver reinitialization delays and flaky behavior between jobs. Turning this off in clusters can cause intermittent job start failures or longer warm‑up times.
Temperatures in 34–41 °C range at idle. This indicates healthy cooling and airflow. High idle temperatures point to heatsink issues, airflow obstructions, fan/BMC problems, or thermal paste degradation.
Performance State: P0 (highest performance). This shows GPUs are not power or thermally‑throttled. If GPUs remain in lower P‑states under load, suspect thermal limits, power caps, or firmware misconfigurations.
Power usage ~70–76 W with cap at 700 W. This confirms ample power headroom and no throttling. GPUs hitting the power cap during load may show reduced performance even when utilization appears high.
GPU utilization at 0% and no running processes. This confirms the node is idle and clean. Useful to rule out “ghost” workloads, leaked CUDA contexts, or stuck processes when diagnosing performance drops.
Memory usage ~1 MiB per GPU. Only driver bookkeeping allocations present. Any significant memory use at idle suggests leftover processes or failed container teardown.
Volatile Uncorrected ECC errors: 0. Confirms memory integrity. Any non‑zero uncorrected ECC errors are serious and usually justify isolating the GPU and starting RMA/vendor diagnostics.
MIG mode: Disabled. Ensures full GPU and NVLink bandwidth availability. MIG partitions can severely impact NCCL and large‑model training if enabled unintentionally.
Compute mode: Default. Allows multiple processes (expected in shared clusters). Exclusive modes can cause unexpected job failures or scheduling issues.
Fan: N/A (SXM platform). Normal for chassis‑controlled cooling. Fan values appearing unexpectedly may indicate incorrect sensor readings or platform misidentification.

Health metrics of all GPUs

nvidia-smi -q

This shows details like:

Serial Number
VBIOS Version
GPU Part Number
Utilization
ECC Errors
Temperature, etc.

Query GPU health metrics

Help: nvidia-smi --help-query-gpu

GPU memory usage and utilization: nvidia-smi --query-gpu=index,name,uuid,driver_version,memory.total,memory.used,utilization.gpu --format=csv

GPU temperature status: nvidia-smi --query-gpu=index,name,uuid,temperature.gpu,temperature.gpu.tlimit,temperature.memory --format=csv

GPU reset state: nvidia-smi --query-gpu=index,name,uuid,reset_status.reset_required,reset_status.drain_and_reset_recommended --format=csv

reset_status.reset_required - indicates whether the GPU must be reset to return to a clean operational state.
reset_status.drain_and_reset_recommended - Yes, indicates GPU/ node should be drained first, then reset. No, indicates reset can be done immediately.
Note: In production GPU clusters based on Kubernetes, the safest and recommended practice is to always drain the node before attempting GPU recovery. For H100 SXM systems, recovery is performed via node reboot, not individual GPU resets.

NVLink topology

nvidia-smi topo -m

Note: Any non‑NVLink GPU‑to‑GPU path on H100 SXM immediately explains poor NCCL performance and requires hardware correction.

nvidia-smi nvlink -s # shows per direction (Tx or Rx) bandwidth of all nvlinks of all GPUs

nvidia-smi nvlink -s -i 0 # shows per direction (Tx or Rx) bandwidth of all nvlinks of the GPU 0