vineethac.blogspot.com: Working with GPUs - Part2

AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Understanding which signals truly indicate GPU memory issues, and how to act on them is essential for stable operations at scale. This post focuses on authoritative memory fault indicators on Nvidia H100 GPUs, how they differ, and how to use them together to make correct operational decisions.

HBM3 (High Bandwidth Memory 3) memory on H100 delivers massive bandwidth, but it operates under extreme thermal, electrical, and utilization stress. When memory reliability starts degrading:

Model training can fail intermittently
NCCL performance may collapse unexpectedly
Silent data corruption risks increase
Faulty GPUs can impact entire multi‑GPU jobs

Early detection lets you drain, reset, isolate, or RMA a GPU before a customer‑visible incident occurs.

Primary indicators

ECC errors

ECC (Error Correcting Code) detects and reports bit‑level memory errors in DRAM and SRAM. A H100 SXM GPU has multiple memory types like:

HBM3 (80 GB off-chip) - DRAM
Shared L2 cache (50 MB on the GPU die) - SRAM
L1 cache and shared memory within a SM - SRAM
Registers within a SM - SRAM

ECC error types:

Correctable Errors (CE)

Single‑bit errors fixed automatically
Indicate beginning of memory stress or aging

Uncorrectable Errors (UE)

Multi‑bit errors not recoverable
High risk of data corruption
Immediate GPU isolation required

nvidia-smi -q -d ECC

As per Nvidia documentation, regarding SRAM uncorrectable errors, the RMA criteria is met for events outlined below. Any of the following events will trigger the SRAM Threshold Exceeded flag:

More than 4 UCE Unique Count events within an address bank for parity protected SRAMs.
More than 2 UCE Unique Count events within an address bank for SECDED ECC protected SRAMs.

nvidia-smi --query-gpu=index,name,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total --format=csv

nvidia-smi --query-gpu=index,name,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total --format=csv

Notes

Rising CE counts is an early warning. Monitor closely.
Any UE count > 0, then drain workload, reboot/ isolate GPU, and if possible, ideally you should use these GPUs which already has history of UCE events for running low-priority jobs.
RMA criteria is met when SRAM Threshold Exceeded flag is set to Yes.

Remapped Rows

Row Remapping is a hardware healing mechanism in H100 HBM3.

When the GPU identifies failing memory rows:

Faulty rows are permanently retired
Spare rows are mapped in their place

nvidia-smi -q -d ROW_REMAPPER

Notes

Remapped Rows Pending = Yes

GPU detected bad memory rows
Reset required to complete remap

Remapped Rows > 0

Hardware has already consumed spare memory
Strong early RMA signal

The RMA criteria is met when the row-remapping failure flag is set

Remapping Failure Occurred: Yes

Bank Remap Availability Histogram shows more insight

None > 0, means one or more banks has zero spare rows left and a row remap attempt that could not be completed and will fail, is an RMA indicator

We should ideally consider/ push for RMA proactively when we notice GPUs with memory banks under Low buckets, so that we don't wait until it fails row remapping!

Increasing count of row remapping and increasing count of banks under partial and low bank remap availability histogram is the earliest and strongest indicator of degrading HBM3 memory; often appearing before serious ECC failures.

Hope it was useful. Cheers!

vineethac.blogspot.com

Pages

Saturday, November 15, 2025

Working with GPUs - Part2 - Memory fault indicators

Primary indicators

ECC errors

Remapped Rows

No comments:

Post a Comment