vineethac.blogspot.com: GPU

Showing posts with label GPU. Show all posts

Sunday, May 17, 2026

Working with GPUs - Part6 - H100 SXM5 architecture

Now that we have a foundational understanding of the core utilities used to monitor and manage GPUs, let's dive into the hardware architecture of the NVIDIA H100 SXM. To truly understand GPU computing, it is essential to visualize how data flows through the silicon. The following overview maps out the internal components of the H100, providing a clear frame of reference so you can easily correlate key architectural terms such as Streaming Multiprocessors (SMs), Tensor Cores, L2 Cache, High Bandwidth Memory (HBM3), etc.

Overview

H100 is released around Sep 2022
Based on the Hopper architecture
It comes in two form factors

PCIe (300W)
SXM (700W)

Has 80 GB HBM3 memory (3.35 TB/s)
132 SMs
528 Tensor cores (4 per SM)
80B transistors on a custom 4N process node

Architecture

Image ref: NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

HBM3 - High Bandwidth Memory

This is the off-chip 80 GB device memory.
Divided in 5 stacks and connected via 10 independent 512-bit memory controllers.
Data flow: SM - L1 - L2 - memory partition/ crossbar - memory controller - HBM stack
H100 SXM5 has 5 HBM3 stacks.
HBM3 stack is DRAM.

L2 cache

50 MB of L2 cache, divided into two 25 MB partitions.
L2 cache is SRAM.

Unified shared memory + L1 cache

256 KB size 33 TB/s bandwidth per SM divided into 32 banks, each 32 bits (4 bytes) wide.
These are SRAM.

Registers

Every thread gets a private set of on-chip registers.
They have very high bandwidth, and very low latency.
256 KB per SM.

Gigathread engine

This is the hardware that takes a kernel launch and hands out thread blocks to SMs.
It tracks which thread blocks are not yet started, running, and finished.
When an SM has capacity to run another thread block, the Gigathread engine assigns the next thread block to that SM.
This ensures intelligent work distribution for optimal GPU utilization.

SM - Streaming Multiprocessor

SMs are the fundamental execution unit of the GPU which executes thread blocks of a CUDA kernel.
H100 SXM5 GPU has 132 SMs.
Following are the components of SM:

FP32 CUDA cores, Int/FP64 units
4th gen Tensor cores
Shared memory/ L1 cache
L1 instruction cache
Warp scheduler
Dispatch units
Registers
L0 instruction cache

Each SM is divided into 4 identical sub-divisions called Quadrants or SMSPs (SM Sub Partitions).

TMA - Tensor Memory Accelerator

Each SM has a TMA unit.
Offloads the tensor copy operations from the SMs.

Tensor core

They are really fast units for performing MMA operations (Matrix Multiply Accumulate).
4 tensor cores per SM.

GPC - Graphics Processing Cluster

It is a group of 18 SMs.
There are 8 GPCs in a H100.
Each GPC is connected to its own chunk of L2 cache.
GPCs also enable the use of distributed shared memory between the SMs.

TPC - Texture Processing Unit

Single TPC holds 2 SMs.
Job of TPC is to have shared SM block, so that communication between the two SMs is really fast.

NVLink

4th gen NVLink.
18 NVLink 4.0 lanes which gives 900 GB/s of GPU-GPU bandwidth.

References

Hope it was useful. Cheers!

Sunday, April 12, 2026

Working with GPUs - Part5 - XID errors

If you are running large-scale AI training or LLM inference, you already know that managing a GPU cluster is less about "if" things break, and more about "when" and "why". In this post, we’ll demystify Nvidia XID errors, interpret them in the context of H100 NVLink systems, and outline a practical approach to triage and remediation.

XID (short for eXception ID) errors are diagnostic messages emitted by the Nvidia kernel driver (NVRM) when a GPU encounters an abnormal condition or fault. While some point to minor software glitches, others signal catastrophic hardware failures. With the H100 equipped with High Bandwidth Memory (HBM3) and NVLink interconnects - understanding these errors are critical to minimizing downtime.

How do we identify if any GPUs has XID errors

DCGMI diag

One way to identify XID errors is from DCGMI level 3 tests. If a critical or fatal hardware XID fires while the Level 3 tests are actively running (or if a sticky hardware error state was already present), the test will fail and output specific error strings. Here is an example:

# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 550.90.07                                      |
| GPU Device IDs Detected   | 2330,2330,2330,2330,2330,2330,2330,2330        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7               |
|                           | Fail - GPU: 0                                  |
| Warning                   | GPU 0 Found 56954234 faulty memory elements o  |
|                           | n GPU 0 Run a field diagnostic on the GPU.     |
| Info                      | GPU 0 Allocated space for 137 output matricie  |
|                           | s from 75937126809 bytes available., GPU 0 Ru  |
|                           | nning with precisions: FP64 1, FP32 1, FP16 1  |
|                           | , GPU 0 GPU 0 calculated at approximately 230  |
|                           | 2.72 gigaflops during this test                |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7               |
|                           | Fail - GPU: 0                                  |
| Warning                   | GPU 0 Detected 43 xid_errors for GPU 0         | < xid_error
| Info                      | GPU 0 GPU 0 power average: 161 W               |
| Info                      | GPU 1 GPU 1 power average: 170 W               |
| Info                      | GPU 2 GPU 2 power average: 164 W               |
| Info                      | GPU 3 GPU 3 power average: 158 W               |
| Info                      | GPU 4 GPU 4 power average: 154 W               |
| Info                      | GPU 5 GPU 5 power average: 169 W               |
| Info                      | GPU 6 GPU 6 power average: 158 W               |
| Info                      | GPU 7 GPU 7 power average: 159 W               |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+

Note that while DCGMI is exceptionally good at flagging structural hardware faults, dcgmi diag will generally not flag application-level or user-space errors, even if they generate XIDs.

DCGMI dmon

Here is another way to look for XID errors using dcgmi dmon.

# dcgmi dmon -e 230 --count 1
#Entity   XIDER
ID
GPU 7     0
GPU 6     0
GPU 5     0
GPU 4     0
GPU 3     0
GPU 2     0
GPU 1     0
GPU 0     43

-e 230 is the filed id that shows the XID errors. The value shown under XIDER column is the specific XID error.
Note: If there are non‑zero values, that would mean one or more GPUs had logged Xid errors, and we need to cross‑reference the specific Xid codes in the kernel log and documentation to understand the nature of the fault.
Ref: Field Identifiers — NVIDIA DCGM Documentation latest documentation

OS logs

You will also find these XID errors in OS kernel logs, and syslog. Following are some examples:

# dmesg | grep -i xid
[  747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
[  747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
[  747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
[ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
[ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
[ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
[45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
[45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
[45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009


# journalctl -k | grep -i xid
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009


# grep -i xid /var/log/syslog
May  4 20:30:00 xx110-r113-node-02 kernel: [  747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
May  4 20:30:00 xx110-r113-node-02 kernel: [  747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
May  4 20:30:00 xx110-r113-node-02 kernel: [  747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
May  4 20:38:23 xx110-r113-node-02 kernel: [ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
May  4 20:38:23 xx110-r113-node-02 kernel: [ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
May  4 20:38:23 xx110-r113-node-02 kernel: [ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
May  4 20:40:44 xx110-r113-node-02 drpcli[4139]: Starting xid error detection test...
May  4 20:40:44 xx110-r113-node-02 drpcli[4139]: [MANDATORY] test_gpu_xid_errors: PASS, GPU XID error check passed. No errors found.
May  5 09:02:43 xx110-r113-node-02 kernel: [45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
May  5 09:02:43 xx110-r113-node-02 kernel: [45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
May  5 09:02:43 xx110-r113-node-02 kernel: [45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009

Common XID errors in H100 NVL GPUs

Application/ CUDA errors: XID 11/25/32/37/69/80 are often caused by application bugs. These are typically recoverable after the application restart.

Memory/ ECC errors: XID 48/64/94/95/140 are caused by GPU memory/ ECC/ remapping related errors or events. Immediate action is to reset the GPU, and if the problem persists, contact your hardware vendor.

NVLink fabric fault: XID 74 indicates a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error. This event may indicate a hardware failure with the link itself or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an XID 74 simply because the link went down as a result. The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links. If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support.

Here is the full list of XIDs, including their applicability across platforms (H100, B100, GB200, etc.): Analyzing Xid Errors with the Xid Catalog — XID Errors

References

Analyzing Xid Errors with the Xid Catalog — XID Errors

Hope this was useful. Cheers!

Saturday, March 7, 2026

Working with GPUs – A Practical Blog Series

This blog series captures practical learnings from working with GPUs in real‑world environments, with a focus on operations, reliability, and scale. Each post deep‑dives into specific aspects of GPU systems based on hands‑on experience, incidents, and operational challenges. Together, these articles aim to share actionable insights, highlight common pitfalls, and help teams build more robust and predictable GPU operations.

Part 01: Using nvidia-smi
Part 02: Memory fault indicators
Part 03: Using dcgmi
Part 04: Thermal issues
Part 05: XID errors
Part 06: H100 SXM5 architecture

Saturday, February 28, 2026

Working with GPUs - Part4 - Thermal issues

During routine diagnostics using dcgmi diag -r 3 test suite, several GPU nodes began failing due to thermal throttling issues. This blog post outlines how the issue was identified, investigated, and resolved.

The problem

Multiple GPU servers began consistently failing level 3 dcgmi stress tests following a routine BMC (Baseboard Management Controller) firmware upgrade on the 8 GPU Supermicro servers. The diagnostic output flagged clocks_throttle_reason_sw_thermal_slowdown errors and indicated some GPUs exceeded the user-specified maximum allowed temperature of 87. Importantly, no underlying GPU hardware faults were identified. The environment was running DCGM version 3.1.8 with GPU driver version 550.90.x.

When the GPU temperature goes above 87, you will see the following thermal diagnostic warnings in the dcgmi r3 results:

Initial investigation

The initial investigation confirmed that no persistent hardware issues, sensor failures, or abnormal readings existed outside of the R3 test execution. The datacenter partner also confirmed no temperature events in the facility, and physical walkthroughs of affected nodes showed conditions normal.

This pointed towards a cooling configuration problem triggered only under the extreme thermal load of the R3 stress test. As a targeted experiment, the server fan mode was manually switched from the default "Optimal Speed" to "Full Speed", followed by a power cycle. The result was definitive: with fans forced to maximum, all GPUs stayed within safe temperature limits and all R3 tests passed. This was validated across multiple nodes and was giving consistent pass results. The datacenter partner's engineering team separately confirmed the recommendation to operate these H100 GPUs is in "Optimal Speed" fan mode, ruling out Full Speed as a viable permanent fix.

While the dcgmi R3 tests are running, we continuously gathered GPU metrics and sensor values using nvidia-smi and ipmitool for further review.

nvidia-smi --query-gpu=index,name,temperature.gpu,temperature.memory --format=csv -l 1 >> gpu_temp.log

# cat gpu_temp.log | grep "HBM3, 88" <<< you can see GPU temperature > 87
2, NVIDIA H100 80GB HBM3, 88, 90
1, NVIDIA H100 80GB HBM3, 88, 86
7, NVIDIA H100 80GB HBM3, 88, 80
6, NVIDIA H100 80GB HBM3, 88, 89
6, NVIDIA H100 80GB HBM3, 88, 91
0, NVIDIA H100 80GB HBM3, 88, 85
1, NVIDIA H100 80GB HBM3, 88, 89
5, NVIDIA H100 80GB HBM3, 88, 91
5, NVIDIA H100 80GB HBM3, 88, 92

# cat gpu_temp.log | grep "HBM3, 89" <<< you can see GPU temperature > 87
3, NVIDIA H100 80GB HBM3, 89, 85
3, NVIDIA H100 80GB HBM3, 89, 81
5, NVIDIA H100 80GB HBM3, 89, 93
5, NVIDIA H100 80GB HBM3, 89, 91

#!/usr/bin/env bash

while true; do
  sudo ipmitool sensor list >> sensor_data.log
done

# cat sensor_data.log | grep Fail | grep GPU
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<< you can see GPU temperature > 87
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |

The above readings clearly indicate the GPUs are actually experiencing high temperature values above 87.

Root cause analysis

Working with the hardware vendor, the team uncovered the precise mechanism behind the cooling failure. The newer BMC firmware had introduced a fundamentally different fan control algorithm - a "T‑Limit"-based thermal management model - replacing the legacy temperature-threshold-based fan curve. However, the SDR (Sensor Data Record), which holds the fan curve data, is intentionally preserved (by default) across BMC firmware upgrades. As a result, after the firmware update the BMC continued operating with the outdated temperature-based fan curve parameters from the previous firmware version.

The change from temperature-based to T.Limit-based fan curve is a fundamental codebase change in the BMC thermal function. The SDR holds the fan curve data and must be cleared for the new curve to take effect. In practical terms, the stale fan curve meant that under the Optimal fan mode, fans did not ramp up aggressively enough to handle the rapid GPU thermal load generated during R3 tests, causing GPUs to exceed 87 °C.

Here is a brief explanation of the difference between temperature based and T.Limit based fan curve. In the temperature-based model, fan speeds are directly tied to specific temperature thresholds. As the temperature of a component (like a GPU) increases, the fan speed increases in predefined steps.

The BMC monitors the GPU temperature.
When the temperature crosses a threshold (e.g., 60°C, 70°C, 80°C), the fan speed increases accordingly.
The fan response is reactive - it only ramps up after the temperature rises.

Example:

GPU Temp (°C)	Fan Speed (%)
< 60	30%
60–70	50%
70–85	85%
> 85	100%

Note: During high-load scenarios (like DCGM R3 stress tests), the temperature can spike rapidly. The fan response may lag, allowing the GPU to overheat before the fans catch up.

T.Limit-Based Fan Curve is a newer model and uses a proactive approach. Instead of waiting for the temperature to rise, it uses the GPU’s thermal limits (T.Limit) and workload predictions to adjust fan speeds preemptively.

The BMC reads the GPU’s T.Limit (e.g., 87°C).
It monitors power draw, workload intensity, and thermal headroom.
Fan speed is adjusted dynamically to prevent the GPU from ever approaching the T.Limit.

Example:

GPU Temp (°C)	Distance from T.Limit (87°C)	Fan Speed (%)	Behavior Description
45	42°C below	20%	Idle state, minimal cooling needed
60	27°C below	50%	Moderate load detected, fans ramping up
70	17°C below	70%	High load, proactive cooling engaged
80	7°C below	100%	Nearing T.Limit, fans set to max speed
85	2°C below	100%	Critical threshold, full fan speed to prevent overheat
87	At T.Limit	100%	Max cooling, risk of thermal throttling
>87	Exceeded	100% + Throttle	Emergency cooling + GPU throttling initiated

Note: The system anticipates thermal load and ramps up fans early, preventing overheating during sudden spikes in GPU usage.

Resolution

The SDR preservation behavior is by design. The --overwrite_sdr flag is the correct and intended mechanism for applying new thermal control parameters when the fan curve implementation changes between BMC firmware versions on Supermicro servers. If you do not want to reflash the BMC firmware with --overwrite_sdr flag, you can try clear the SDR using ipmitool. In my case, I've used the following command to clear the SDR.

ipmitool -I lanplus -H <BMC_IP> -U <USERNAME> -P '<PASSWORD>' raw 0x30 0x44

Note: In a production environment, you may first check with the server hardware vendor before executing these commands. It may also vary between different vendors.

After clearing the SDR and performing a server power cycle, the thermal diagnostic warnings were no longer observed, and all dcgmi diag -r 3 tests passed successfully with the fan mode set to Optimal.

# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 550.90.07                                      |
| GPU Device IDs Detected   | 2330,2330,2330,2330,2330,2330,2330,2330        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+

Identifying and resolving GPU thermal issues is critical to maintaining system stability and performance, especially under high-load scenarios like training jobs. Left unaddressed, thermal throttling can degrade performance, cause test failures, and even lead to hardware damage or job interruptions. Proactive thermal management ensures reliable operation and maximizes the efficiency of GPU-intensive workloads.

Hope it was useful. Cheers!

Sunday, January 25, 2026

Working with GPUs - Part3 - Using dcgmi

The NVIDIA Data Center GPU Manager (DCGM) is a lightweight agent that performs several functions like GPU behavior monitoring, health and diagnostics, policy management, etc. DCGM is the underlying framework, and when you install it, it runs a service called the Host Engine, which collects data, monitors health, and manages GPUs. DCGMI is simply the CLI tool (the interface) used to talk to the engine. If you want to know what the engine is seeing or if you want to tell the engine to do something, you use dcgmi.

Install DCGM

In my case, I am installing it on Ubuntu 22.04.2 host.
You can download the required binaries from this Nvidia repository.
If there are previous versions of the package installed, you can follow this documentation and remove them.
For example, I am installing version 4.5.2, which is compatible with cuda 13.0.
Download the following packages from the above mentioned repo:

datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb

Install them.

sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb

# Enable the dcgm service
systemctl --now enable nvidia-dcgm
systemctl start nvidia-dcgm

# Check if the service is active
systemctl is-active --quiet nvidia-dcgm

Verify the installed version using: dcgmi --version

# dcgmi --version

dcgmi  version: 4.5.2

List all GPUs discovered by the host engine: dcgmi discovery -l

DCGM Diagnostics

Diagnostics is a subsystem within DCGM designed to stress-test and validate the physical and logical integrity of the GPU. It is a suite of automated tests that push the GPU beyond its normal idle state to uncover hidden hardware defects, driver instabilities, or environmental issues (like poor cooling or failing power supplies). In production environments, this utility helps to assess cluster readiness levels before a workload is deployed on it. It supports multiple run levels as explained below.

Level 1: used for sanity Check which run before starting a container or job to ensure the GPU is "alive."
Level 2: used for analyzing/ examining failures and to get more context about it.
Level 3/4: for extensive hardware screening (e.g., checking for thermal throttling, bandwidth checks, etc.).

Here is how you can run a level 3 test: dcgmi diag -r 3

# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.5.2                                          |
| Driver Version Detected   | 580.105.08                                     |
| GPU Device IDs Detected   | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|-----  Deployment  --------+------------------------------------------------|
| software                  | Fail                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU3: Fail                                     |
| Warning: GPU3             | Page Retirement/Row Remap: GPU 3 had uncorrec  |
|                           | table memory errors and row remapping failed.  |
|                           |  Run a field diagnostic on the GPU.            |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
+-----  Hardware  ----------+------------------------------------------------+
| memory                    | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| diagnostic                | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| nvbandwidth               | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Integration  -------+------------------------------------------------+
| pcie                      | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Stress  ------------+------------------------------------------------+
| memory_bandwidth          | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_stress           | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_power            | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+---------------------------+------------------------------------------------+

# dcgmi diag -r 4
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.5.2                                          |
| Driver Version Detected   | 580.105.08                                     |
| GPU Device IDs Detected   | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|-----  Deployment  --------+------------------------------------------------|
| software                  | Fail                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU3: Fail                                     |
| Warning: GPU3             | Page Retirement/Row Remap: GPU 3 had uncorrec  |
|                           | table memory errors and row remapping failed.  |
|                           |  Run a field diagnostic on the GPU.            |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
+-----  Hardware  ----------+------------------------------------------------+
| memory                    | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| diagnostic                | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| nvbandwidth               | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| pulse_test                | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Integration  -------+------------------------------------------------+
| pcie                      | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Stress  ------------+------------------------------------------------+
| memtest                   | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| memory_bandwidth          | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_stress           | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_power            | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+---------------------------+------------------------------------------------+

Here you can see GPU 3 had uncorrectable memory errors and row remap failed.

The dcgmi diag utility consists of multiple plugins as detailed below. Based on the selected run levels, respective plugins will be used to conduct the tests.

Deployment: verifies the compute environment is ready to run CUDA applications and is able to load the NVML library.
Diagnostic: performs large matrix multiplications. This will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors like XIDs, temperature violations, uncorrectable memory errors, etc. as well as the correctness of data being written and read.
PCIe - GPU bandwidth: purpose of this plugin is to stress the communication from the host to the GPUs as well as among the GPUs on the system. It will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe.
GPU memory: It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues.
Targeted power: This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.
Targeted stress: maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.
Memtest diagnostic: this is similar to memtest86, which will exercise GPU memory with various test patterns.
Pulse test: this is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
NVbandwidth: performs bandwidth measurements on NVIDIA GPUs on a single host.
Memory bandwidth: It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.

Note: It is highly recommended to run these diagnostic tests while the node is in maintenance mode or when no active workloads (such as training jobs or inference services) are running on the GPU. Attempting to run higher-level diagnostics (especially levels 3 and 4) on an active node is a recipe for trouble: the diagnostic tests will likely fail to get the resources they need, and the contention for compute engines and VRAM may cause your production workloads to crash.

References

docs.nvidia.com/datacenter/dcgm

Hope it was useful. Cheers!

Saturday, November 15, 2025

Working with GPUs - Part2 - Memory fault indicators

AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Understanding which signals truly indicate GPU memory issues, and how to act on them is essential for stable operations at scale. This post focuses on authoritative memory fault indicators on Nvidia H100 GPUs, how they differ, and how to use them together to make correct operational decisions.

HBM3 (High Bandwidth Memory 3) memory on H100 delivers massive bandwidth, but it operates under extreme thermal, electrical, and utilization stress. When memory reliability starts degrading:

Model training can fail intermittently
NCCL performance may collapse unexpectedly
Silent data corruption risks increase
Faulty GPUs can impact entire multi‑GPU jobs

Early detection lets you drain, reset, isolate, or RMA a GPU before a customer‑visible incident occurs.

Primary indicators

ECC errors

ECC (Error Correcting Code) detects and reports bit‑level memory errors in DRAM and SRAM. A H100 SXM GPU has multiple memory types like:

HBM3 (80 GB off-chip) - DRAM
Shared L2 cache (50 MB on the GPU die) - SRAM
L1 cache and shared memory within a SM - SRAM
Registers within a SM - SRAM

ECC error types:

Correctable Errors (CE)

Single‑bit errors fixed automatically
Indicate beginning of memory stress or aging

Uncorrectable Errors (UE)

Multi‑bit errors not recoverable
High risk of data corruption
Immediate GPU isolation required

nvidia-smi -q -d ECC

As per Nvidia documentation, regarding SRAM uncorrectable errors, the RMA criteria is met for events outlined below. Any of the following events will trigger the SRAM Threshold Exceeded flag:

More than 4 UCE Unique Count events within an address bank for parity protected SRAMs.
More than 2 UCE Unique Count events within an address bank for SECDED ECC protected SRAMs.

nvidia-smi --query-gpu=index,name,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total --format=csv

nvidia-smi --query-gpu=index,name,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total --format=csv

Notes

Rising CE counts is an early warning. Monitor closely.
Any UE count > 0, then drain workload, reboot/ isolate GPU, and if possible, ideally you should use these GPUs which already has history of UCE events for running low-priority jobs.
RMA criteria is met when SRAM Threshold Exceeded flag is set to Yes.

Remapped Rows

Row Remapping is a hardware healing mechanism in H100 HBM3.

When the GPU identifies failing memory rows:

Faulty rows are permanently retired
Spare rows are mapped in their place

nvidia-smi -q -d ROW_REMAPPER

Notes

Remapped Rows Pending = Yes

GPU detected bad memory rows
Reset required to complete remap

Remapped Rows > 0

Hardware has already consumed spare memory
Strong early RMA signal

The RMA criteria is met when the row-remapping failure flag is set

Remapping Failure Occurred: Yes

Bank Remap Availability Histogram shows more insight

None > 0, means one or more banks has zero spare rows left and a row remap attempt that could not be completed and will fail, is an RMA indicator

We should ideally consider/ push for RMA proactively when we notice GPUs with memory banks under Low buckets, so that we don't wait until it fails row remapping!

Increasing count of row remapping and increasing count of banks under partial and low bank remap availability histogram is the earliest and strongest indicator of degrading HBM3 memory; often appearing before serious ECC failures.

Hope it was useful. Cheers!

Saturday, October 18, 2025

Working with GPUs - Part1 - Using nvidia-smi

GPUs are the backbone of modern AI and HPC clusters, and understanding their basic health and configuration is the first step toward reliable operations. In this first post of the series, we start with nvidia-smi - the primary tool for discovering Nvidia GPUs, validating drivers and CUDA versions, and performing essential health checks. These fundamentals form the baseline for monitoring, performance benchmarking, and troubleshooting GPU compute nodes at scale.

Verify version

nvidia-smi --version

List all GPUs

nvidia-smi -L

Current state of every GPU

nvidia-smi

Following are the key observations from the above output:

All 8 GPUs detected (NVIDIA H100 80GB HBM3). Confirms full hardware enumeration.
HBM3 memory present (80GB per GPU). Validates expected SKU (H100 SXM vs PCIe). This is important because SXM GPUs behave differently in power, cooling, and NVLink bandwidth; troubleshooting playbooks differ by form factor.
Driver version 550.90.07 with CUDA compatibility 12.4. This confirms a Hopper‑supported, production‑ready driver stack. Many issues (NCCL failures, DCGM errors, framework crashes) trace back to unsupported driver–CUDA combinations.
Persistence Mode: On. This avoids GPU driver reinitialization delays and flaky behavior between jobs. Turning this off in clusters can cause intermittent job start failures or longer warm‑up times.
Temperatures in 34–41 °C range at idle. This indicates healthy cooling and airflow. High idle temperatures point to heatsink issues, airflow obstructions, fan/BMC problems, or thermal paste degradation.
Performance State: P0 (highest performance). This shows GPUs are not power or thermally‑throttled. If GPUs remain in lower P‑states under load, suspect thermal limits, power caps, or firmware misconfigurations.
Power usage ~70–76 W with cap at 700 W. This confirms ample power headroom and no throttling. GPUs hitting the power cap during load may show reduced performance even when utilization appears high.
GPU utilization at 0% and no running processes. This confirms the node is idle and clean. Useful to rule out “ghost” workloads, leaked CUDA contexts, or stuck processes when diagnosing performance drops.
Memory usage ~1 MiB per GPU. Only driver bookkeeping allocations present. Any significant memory use at idle suggests leftover processes or failed container teardown.
Volatile Uncorrected ECC errors: 0. Confirms memory integrity. Any non‑zero uncorrected ECC errors are serious and usually justify isolating the GPU and starting RMA/vendor diagnostics.
MIG mode: Disabled. Ensures full GPU and NVLink bandwidth availability. MIG partitions can severely impact NCCL and large‑model training if enabled unintentionally.
Compute mode: Default. Allows multiple processes (expected in shared clusters). Exclusive modes can cause unexpected job failures or scheduling issues.
Fan: N/A (SXM platform). Normal for chassis‑controlled cooling. Fan values appearing unexpectedly may indicate incorrect sensor readings or platform misidentification.

Health metrics of all GPUs

nvidia-smi -q

This shows details like:

Serial Number
VBIOS Version
GPU Part Number
Utilization
ECC Errors
Temperature, etc.

Query GPU health metrics

Help: nvidia-smi --help-query-gpu

GPU memory usage and utilization: nvidia-smi --query-gpu=index,name,uuid,driver_version,memory.total,memory.used,utilization.gpu --format=csv

GPU temperature status: nvidia-smi --query-gpu=index,name,uuid,temperature.gpu,temperature.gpu.tlimit,temperature.memory --format=csv

GPU reset state: nvidia-smi --query-gpu=index,name,uuid,reset_status.reset_required,reset_status.drain_and_reset_recommended --format=csv

reset_status.reset_required - indicates whether the GPU must be reset to return to a clean operational state.
reset_status.drain_and_reset_recommended - Yes, indicates GPU/ node should be drained first, then reset. No, indicates reset can be done immediately.
Note: In production GPU clusters based on Kubernetes, the safest and recommended practice is to always drain the node before attempting GPU recovery. For H100 SXM systems, recovery is performed via node reboot, not individual GPU resets.

NVLink topology

nvidia-smi topo -m

Note: Any non‑NVLink GPU‑to‑GPU path on H100 SXM immediately explains poor NCCL performance and requires hardware correction.

nvidia-smi nvlink -s # shows per direction (Tx or Rx) bandwidth of all nvlinks of all GPUs

nvidia-smi nvlink -s -i 0 # shows per direction (Tx or Rx) bandwidth of all nvlinks of the GPU 0

Hope it was useful. Cheers!

Pages

Sunday, May 17, 2026

Overview

Architecture

HBM3 - High Bandwidth Memory

L2 cache

Unified shared memory + L1 cache

Registers

Gigathread engine

SM - Streaming Multiprocessor

TMA - Tensor Memory Accelerator

Tensor core

GPC - Graphics Processing Cluster

TPC - Texture Processing Unit

NVLink

References

Sunday, April 12, 2026

How do we identify if any GPUs has XID errors

DCGMI diag

DCGMI dmon

OS logs

Common XID errors in H100 NVL GPUs

References

Saturday, March 7, 2026

Saturday, February 28, 2026

The problem

Initial investigation

Root cause analysis

Resolution

Sunday, January 25, 2026

Install DCGM

DCGM Diagnostics

References

Saturday, November 15, 2025

Primary indicators

ECC errors

Remapped Rows

Saturday, October 18, 2025

Verify version

nvidia-smi --version

List all GPUs

Current state of every GPU

Health metrics of all GPUs

Query GPU health metrics

NVLink topology