Showing posts with label nvswitch. Show all posts
Showing posts with label nvswitch. Show all posts

Sunday, April 12, 2026

Working with GPUs - Part5 - XID errors

If you are running large-scale AI training or LLM inference, you already know that managing a GPU cluster is less about "if" things break, and more about "when" and "why". In this post, we’ll demystify Nvidia XID errors, interpret them in the context of H100 NVLink systems, and outline a practical approach to triage and remediation.


XID (short for eXception ID) errors are diagnostic messages emitted by the Nvidia kernel driver (NVRM) when a GPU encounters an abnormal condition or fault. While some point to minor software glitches, others signal catastrophic hardware failures. With the H100 equipped with High Bandwidth Memory (HBM3) and NVLink interconnects - understanding these errors are critical to minimizing downtime.

How do we identify if any GPUs has XID errors


DCGMI diag


One way to identify XID errors is from DCGMI level 3 tests. If a critical or fatal hardware XID fires while the Level 3 tests are actively running (or if a sticky hardware error state was already present), the test will fail and output specific error strings. Here is an example:
# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 550.90.07                                      |
| GPU Device IDs Detected   | 2330,2330,2330,2330,2330,2330,2330,2330        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7               |
|                           | Fail - GPU: 0                                  |
| Warning                   | GPU 0 Found 56954234 faulty memory elements o  |
|                           | n GPU 0 Run a field diagnostic on the GPU.     |
| Info                      | GPU 0 Allocated space for 137 output matricie  |
|                           | s from 75937126809 bytes available., GPU 0 Ru  |
|                           | nning with precisions: FP64 1, FP32 1, FP16 1  |
|                           | , GPU 0 GPU 0 calculated at approximately 230  |
|                           | 2.72 gigaflops during this test                |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7               |
|                           | Fail - GPU: 0                                  |
| Warning                   | GPU 0 Detected 43 xid_errors for GPU 0         | < xid_error
| Info                      | GPU 0 GPU 0 power average: 161 W               |
| Info                      | GPU 1 GPU 1 power average: 170 W               |
| Info                      | GPU 2 GPU 2 power average: 164 W               |
| Info                      | GPU 3 GPU 3 power average: 158 W               |
| Info                      | GPU 4 GPU 4 power average: 154 W               |
| Info                      | GPU 5 GPU 5 power average: 169 W               |
| Info                      | GPU 6 GPU 6 power average: 158 W               |
| Info                      | GPU 7 GPU 7 power average: 159 W               |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+
Note that while DCGMI is exceptionally good at flagging structural hardware faults, dcgmi diag will generally not flag application-level or user-space errors, even if they generate XIDs.

DCGMI dmon


Here is another way to look for XID errors using dcgmi dmon.
# dcgmi dmon -e 230 --count 1
#Entity   XIDER
ID
GPU 7     0
GPU 6     0
GPU 5     0
GPU 4     0
GPU 3     0
GPU 2     0
GPU 1     0
GPU 0     43
  • -e 230 is the filed id that shows the XID errors. The value shown under XIDER column is the specific XID error.
  • Note: If there are non‑zero values, that would mean one or more GPUs had logged Xid errors, and we need to cross‑reference the specific Xid codes in the kernel log and documentation to understand the nature of the fault.
  • Ref: Field Identifiers — NVIDIA DCGM Documentation latest documentation

OS logs


You will also find these XID errors in OS kernel logs, and syslog. Following are some examples:
# dmesg | grep -i xid
[  747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
[  747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
[  747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
[ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
[ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
[ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
[45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
[45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
[45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009
# journalctl -k | grep -i xid May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174 May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009 May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174 May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009 May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174 May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009 # grep -i xid /var/log/syslog May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174 May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009 May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174 May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009 May 4 20:40:44 xx110-r113-node-02 drpcli[4139]: Starting xid error detection test... May 4 20:40:44 xx110-r113-node-02 drpcli[4139]: [MANDATORY] test_gpu_xid_errors: PASS, GPU XID error check passed. No errors found. May 5 09:02:43 xx110-r113-node-02 kernel: [45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address May 5 09:02:43 xx110-r113-node-02 kernel: [45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174 May 5 09:02:43 xx110-r113-node-02 kernel: [45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009

Common XID errors in H100 NVL GPUs


Application/ CUDA errors: XID 11/25/32/37/69/80 are often caused by application bugs. These are typically recoverable after the application restart.

Memory/ ECC errors: XID 48/64/94/95/140 are caused by GPU memory/ ECC/ remapping related errors or events. Immediate action is to reset the GPU, and if the problem persists, contact your hardware vendor.

NVLink fabric fault: XID 74 indicates a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error. This event may indicate a hardware failure with the link itself or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an XID 74 simply because the link went down as a result. The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links. If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support.

Here is the full list of XIDs, including their applicability across platforms (H100, B100, GB200, etc.): Analyzing Xid Errors with the Xid Catalog — XID Errors

References